LinearModel¶

class LinearModel¶

Bases: Model

LinearModel object.

Attributes¶

datapandas.DataFrame: Data for the linear regression model, is a pandas.DataFrame containing all regressor variables \(X\) and the response variable \(y\).
response_variablestr: Response variable \(y\) of the linear model.
priorsdict: Priors for the regressors’ and variance parameters.
variable_nameslist: The list of all model variables: the regressors \(X\), including the 'intercept' and the 'variance' \(\sigma^2\).
posteriorsdict: Posterior samples. Posteriors and relative samples are key-value pairs. Each sample is a numpy.ndarray with a number of rows equals to the number of iterations and a number of columns equal to the number of Markov chains.

Methods¶

posteriors_to_frame(): It organizes the posteriors in a pandas.DataFrame.
residuals(): It computes the residuals \(\epsilon\) with respect to predicted values \(\hat{y}\).
predict_distribution(): It predicts a posterior distribution for an unobserved values.
likelihood(): It computes the likelihood of observations response_variable given a model 'mean' and 'variance'.
log_likelihood(): It computes the log likelihood of observations response_variable given a model 'mean' and 'variance'.

Data for the linear regression model, is a pandas.DataFrame containing all regressor variables \(X\) and the response variable \(y\).

Returns¶

pandas.DataFrame: Observed data of the model. It cannot be empty. It must contain regressor variables \(X\) and the response_variable \(y\).

Raises

TypeError: If data is not an instance of pandas.DataFrame.
ValueError: If data is an empty pandas.DataFrame.

likelihood(data: DataFrame) → ndarray¶

It computes the likelihood of observations response_variable given a model 'mean' and 'variance'.

Parameters¶

data: pandas.DataFrame: Data to use for likelihood computation. It cannot be empty. It must contain columns response_variable, 'mean' and 'variance'.

Returns¶

numpy.ndarray: Array of computed likelihood. It has the same length of data. Each element is a likelihood computation of each row of data.

Raises

TypeError

If data is not an instance of pandas.DataFrame.

ValueError

If data is an empty pandas.DataFrame,
if response_variable is not a column of data,
if 'mean' is not a column of data,
if 'variance' is not a column of data.

Notes

The likelihood is computed with the normal distribution probability density function:

\[L(y) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp{- \frac{\left(y - \mu \right)^2}{2 \sigma^2}}\]

where \(\mu\) is the 'mean' column and \(\sigma^2\) is the 'variance' column.

log_likelihood(data: DataFrame) → ndarray¶

It computes the log likelihood of observations response_variable given a model 'mean' and 'variance'.

Parameters¶

data: pandas.DataFrame: Data to use for log likelihood computation. It cannot be empty. It must contain columns response_variable, 'mean' and 'variance'.

Returns¶

numpy.ndarray: Array of computed log likelihood. It has the same length of data. Each element is a log likelihood computation of each row of data.

Raises

TypeError

If data is not an instance of pandas.DataFrame.

ValueError

If data is an empty pandas.DataFrame,
if response_variable is not a column of data,
if 'mean' is not a column of data,
if 'variance' is not a column of data.

Notes

The log likelihood is computed as the log of the normal distribution probability density function:

\[l(y) = - \frac{1}{2} \log{2 \pi \sigma^2} - \frac{1}{2} \frac{\left(y - \mu \right)^2}{\sigma^2}\]

where \(\mu\) is the 'mean' column and \(\sigma^2\) is the 'variance' column.

property posteriors: dict[str, ndarray]¶

Posteriors of the regressors’ and variance parameters. Posteriors and relative samples are key-value pairs. Each sample is a numpy.ndarray with a number of rows equals to the number of iterations and a number of columns equal to the number of Markov chains.

Returns¶

dict: Posterior samples. Posteriors and relative samples are key-value pairs. Each sample is a numpy.ndarray with a number of rows equals to the number of iterations and a number of columns equal to the number of Markov chains.

Raises

TypeError

If posteriors is not a dict,
if a posterior sample is not a numpy.ndarray.

KeyError

If posteriors does not contain both 'intercept' and 'variance' keys.

ValueError

If a posterior sample is an empty numpy.ndarray.

posteriors_to_frame() → DataFrame¶

It organizes the posteriors in a pandas.DataFrame. Each posterior is a frame column. The length of the frame is the number of sampling iterations times the number of sampling chains.

Returns¶

pandas.DataFrame: Returns posterior samples. Posteriors are organized in a pandas.DataFrame, one for each column. The length of the frame is the number of sampling iterations times the number of sampling chains.

Raises

ValueError: If posteriors are not available because the method LinearRegression.sample has not been called yet.

predict_distribution(predictors: dict[str, float | int]) → ndarray¶

It predicts a posterior distribution for an unobserved values. For each posterior sample, it draws a sample from the likelihood.

Parameters¶

predictorsdict: Values of predictors \(X\) at which compute the posterior distribution. Each predictor has to be set as a key-value pair.

Returns¶

numpy.ndarray: Array of the predicted posterior distribution. It contains a number of element equal to the number of regression iterations times the number of model Markov chains.

Raises

TypeError: If predictors is not a dict.
KeyError: If a predictors key is not a key of posteriors.
ValueError: If predictors is an empty dict.

Returns¶

dict: Priors for each random variable. It must contain an 'intercept' and a 'variance' keys. Each value must be a dict with hyperparameter names as key and hyperparameter values as values.

Raises

TypeError

If priors is not a dict,
if a priors’ value is not a dict.

ValueError

If priors is an empty dict,
if a priors’ value is an empty dict,
if a 'variance' value is not positive,
if a 'shape' value is not positive,
if a 'scale' value is not positive.

KeyError

If priors does not contain both 'intercept' and 'variance' keys,
if a prior’s hyperparameters are not:
- 'mean' and 'variance' for a regression parameter \(\beta_j\) or
- 'shape' and 'scale' for variance \(\sigma^2\).

Notes

To each random variables is assigned a prior distribution:

to each regressor parameter \(\beta_j\) is assigned a normal prior distribution with hyperparameters 'mean' \(\beta_j^0\) and 'variance' \(\Sigma_{\beta_j}^0\):

\[\beta_j \sim N(\beta_j^0 , \Sigma_{\beta_j}^0)\]
to variance \(\sigma^2\) is assigned an inverse gamma distribution with hyperparameters 'shape' \(\kappa^0\) and 'scale' \(\theta^0\):

\[\sigma^2 \sim \text{Inv-}\Gamma(\kappa^0, \theta^0)\]

Examples

Consider a linear regression of the response_variable \(y\) with respect to regressors \(x_1\), \(x_2\) and \(x_3\), according to the following model:

\[y \sim N(\mu, \sigma^2)\]

\[\mu = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3\]

then the sampler would require priors for:

parameter \(\beta_0\) of variable 'intercept', with 'mean' \(\beta_0^0\) and 'variance' \(\Sigma_{\beta_0}^0\)
parameter \(\beta_1\) of variable \(x_1\), with 'mean' \(\beta_1^0\) and 'variance' \(\Sigma_{\beta_1}^0\)
parameter \(\beta_2\) of variable \(x_2\), with 'mean' \(\beta_2^0\) and 'variance' \(\Sigma_{\beta_2}^0\)
parameter \(\beta_3\) of variable \(x_3\), with 'mean' \(\beta_3^0\) and 'variance' \(\Sigma_{\beta_3}^0\)
variable \(\sigma^2\), with 'shape' \(\kappa^0\) and 'scale' \(\theta^0\)

>>> model = baypy.model.LinearModel()
>>> model.priors = {
...     'intercept': {'mean': 0, 'variance': 1e6},
...     'x_1': {'mean': 0, 'variance': 1e6},
...     'x_2': {'mean': 0, 'variance': 1e6},
...     'x_3': {'mean': 0, 'variance': 1e6},
...     'variance': {'shape': 1, 'scale': 1e-6}
... }

residuals() → DataFrame¶

It computes the residuals \(\epsilon\) with respect to predicted values \(\hat{y}\).

Returns¶

pandas.DataFrame: Returns a copy of data with 3 more columns: 'intercept', 'predicted' and 'residuals'.

Raises

ValueError

If data is None because the property data has not been set,
if response_variable is not a column of data,
If a posteriors is None because the sampling has not been done yet.

Notes

Predicted values are computed at data points \(X\) using the posteriors means for each regressor’s parameter:

\[\hat{y_i} = \beta_0 + \sum_{j = 1}^{m} \beta_j x_{i,j}\]

while residuals are the difference between the observed values and the predicted values of the response_variable:

\[\epsilon_i = y_i - \hat{y_i}\]

property response_variable: str¶

Response variable \(y\) of the linear model.

Returns¶

str: Name of the response variable \(y\). In must be one of the columns of data.

Raises

TypeError: If response_variable is not a str.

property variable_names: list[str]¶

Variables of the linear model.

Returns¶

list: The list of all model variables: the regressors \(X\), including the 'intercept' and the 'variance' \(\sigma^2\).