Real Estate¶
Model Set Up¶
Determine the price of houses by their features.
The market historical data set of real estate valuation are collected
from Sindian Dist., New Taipei City, Taiwan.
Link to the dataset
Dataset original source
import pandas as pd
data = pd.read_csv(r'data/data.csv')
data.drop(columns = ['No'], inplace = True)
data.columns = [' '.join(col.split(' ')[1:]) for col in data.columns]
data.rename(columns = {'distance to the nearest MRT station': 'MRT station distance',
'number of convenience stores': 'stores number',
'house price of unit area': 'house price'},
inplace = True)
Data exploration:
import matplotlib.pyplot as plt
pd.plotting.scatter_matrix(frame = data, figsize = (10, 10))
plt.tight_layout()
plt.show()
The scatterplot shows no strong correlation among regressors.
There are two ouliers in house price, one under 8 and the other over
115, that do not follow the rest of distributions. For this reason, the
outlier are removed.
house price and MRT station distance are not normally distributed:
data are skewed toward high values. For this reason, these columns are
transformed to log-scale:
import numpy as np
data = data[(data['house price'] > 8) & (data['house price'] < 115)]
data['log house price'] = np.log(data['house price'])
data['log MRT station distance'] = np.log(data['MRT station distance'])
Set up a linear regression model, considering transaction date,
house age, log MRT station distance, stores number, latitude and
longitude as the regressors and log house price as the response
variable.
Using non-informative priors for regressors and variance:
from baypy.model import LinearModel
import baypy as bp
model = LinearModel()
model.data = data
model.response_variable = 'log house price'
model.priors = {'intercept': {'mean': 0, 'variance': 1e6},
'transaction date': {'mean': 0, 'variance': 1e6},
'house age': {'mean': 0, 'variance': 1e6},
'log MRT station distance': {'mean': 0, 'variance': 1e6},
'stores number': {'mean': 0, 'variance': 1e6},
'latitude': {'mean': 0, 'variance': 1e6},
'longitude': {'mean': 0, 'variance': 1e6},
'variance': {'shape': 1, 'scale': 1e-6}}
Sampling¶
Run the regression sampling on 3 Markov chains, with 1000 iterations per each chain and discarding the first 50 burn-in draws:
from baypy.regression import LinearRegression
LinearRegression.sample(model = model, n_iterations = 1000,
burn_in_iterations = 50, n_chains = 3, seed = 137)
Convergence Diagnostics¶
Asses the model convergence diagnostics:
bp.diagnostics.effective_sample_size(posteriors = model.posteriors)
intercept transaction date house age log MRT station distance stores number latitude longitude variance
Effective Sample Size 2767.17 2833.16 2548.86 2877.61 2630.62 2770.24 2753.23 2778.72
bp.diagnostics.autocorrelation_summary(posteriors = model.posteriors)
intercept transaction date house age log MRT station distance stores number latitude longitude variance
Lag 0 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
Lag 1 -0.009739 -0.000259 0.000069 -0.027052 0.001308 0.009574 -0.028492 0.034033
Lag 5 -0.004870 -0.010960 -0.017678 0.010558 0.003372 -0.004647 -0.003221 -0.029635
Lag 10 0.014359 0.003361 0.009231 0.013320 -0.012113 -0.017253 0.016727 -0.000938
Lag 30 -0.000886 0.031398 -0.030163 -0.027021 0.004524 0.000075 -0.034411 -0.043168
bp.diagnostics.autocorrelation_plot(posteriors = model.posteriors)
All diagnostics show a low correlation, indicating the chains converged to the stationary distribution.
Posteriors Analysis¶
Asses posterior analysis:
bp.analysis.trace_plot(posteriors = model.posteriors)
Traces are good, incidating draws from the stationary distribution.
bp.analysis.residuals_plot(model = model)
Also the residuals plot is good: no evidence for patterns, shapes or outliers.
bp.analysis.summary(posteriors = model.posteriors)
Number of chains: 3
Sample size per chian: 1000
Empirical mean, standard deviation, 95% HPD interval for each variable:
Mean SD HPD min HPD max
intercept -914.144001 114.532293 -1140.625297 -689.444488
transaction date 0.165923 0.032401 0.102919 0.227775
house age -0.006024 0.000831 -0.007647 -0.004449
log MRT station distance -0.166081 0.013939 -0.192367 -0.137433
stores number 0.014637 0.004559 0.005655 0.023409
latitude 9.593946 0.851219 7.979301 11.251111
longitude 2.840717 0.797554 1.193730 4.361885
variance 0.034584 0.002446 0.029867 0.039150
Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
intercept -1145.639542 -989.823076 -913.067376 -837.985390 -693.822181
transaction date 0.104168 0.144264 0.165367 0.187846 0.229589
house age -0.007658 -0.006587 -0.006004 -0.005448 -0.004454
log MRT station distance -0.194091 -0.175519 -0.166026 -0.156556 -0.138945
stores number 0.005802 0.011556 0.014616 0.017608 0.023622
latitude 7.933698 9.032735 9.583895 10.176279 11.219930
longitude 1.262242 2.303130 2.840166 3.366287 4.466213
variance 0.030052 0.032863 0.034565 0.036164 0.039478
The summary reports a statistical evidence for:
positive effect of transaction date: \(1\) month increase would result in \(e^{\frac{0.165923}{12}} - 1 = 1.4\%\) percent increase in house price
negative effect of house age: \(1\) year increase would result in \(e^{-0.006024} - 1 = -0.6\%\) percent decrease in house price
negative effect of log MRT station distance: \(10\%\) percent increase in MRT station distance would result in \(1.10^{-0.166081} - 1 = -1.57\%\) percent decrease in house price
positive effect of stores number: \(1\) store increase would result in \(e^{0.014637} - 1 = 1.47\%\) percent increase in house price
positive effect of latitute: \(1'\) increase would result in \(e^{\frac{9.593946}{60}} - 1 = 17.3\%\) percent increase house price
positive effect of longitude: \(1'\) increase would result in \(e^{\frac{2.840717}{60}} - 1 = 4.85\%\) percent increase in house price
The combined effect of latitude and longitude suggest that the north-east of New Taipei City is the most expensive area, while the south-west is the most cheap area.