House Prices: Advanced Regression Techniques
Predict sales prices and practice feature engineering, RFs, and gradient boosting
- Description
- Evaluation
- Let's get stated!
Start here if...
You have some experience with R or Python and machine learning basics. This is a perfect competition for data science students who have completed an online course in machine learning and are looking to expand their skill set before trying a featured competition.
Competition Description
Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
Practice Skills
Creative feature engineering Advanced regression techniques like random forest and gradient boosting Acknowledgments The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset. !
Goal
It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable.
Metric
Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)
Submission File Format
The file should contain a header and have the following format:
Id,SalePrice 1461,169000.1 1462,187724.1233 1463,175221 etc. You can download an example submission file (sample_submission.csv) on the Data page.
## load the library
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
import missingno as msno
import scipy.stats as stats
url_train = "https://raw.githubusercontent.com/lucastiagooliveira/lucastiagooliveira/master/Kaggle/house-prices-advanced-regression-techniques/train.csv"
url_test = "https://raw.githubusercontent.com/lucastiagooliveira/lucastiagooliveira/master/Kaggle/house-prices-advanced-regression-techniques/test.csv"
df_train = pd.read_csv(url_train)
df_test = pd.read_csv(url_test)
combine = [df_train, df_test]
df_train.head()
df_test.head()
print(df_train.head().info())
print('_'*50)
print(df_test.head().info())
sns.set()
sns.distplot(df_train.SalePrice, color = 'b')
print('Skewness: %f' % df_train.SalePrice.skew())
print('Kurtosis: %f' % df_train.SalePrice.kurt())
quant = [i for i in df_train.columns if df_train[i].dtypes != object]
quali = [i for i in df_train.columns if df_train[i].dtypes == object]
quant.remove('Id')
quant.remove('SalePrice')
# quant = df_train[quant]
# quali = df_train[quali]
target = df_train.SalePrice
# Print quantitative varibles
sns.set(style="darkgrid")
melted = pd.melt(df_train, value_vars= quant)
g = sns.FacetGrid(melted, col = 'variable', margin_titles=True, col_wrap = 3, sharex = False, sharey = False, height = 5)
g.map(sns.distplot, "value", color="steelblue")
# Print qualitative varibles
def boxplot(x, y, **kwargs):
sns.boxplot(x=x, y=y)
x = plt.xticks(rotation = 90)
sns.set()
melted = pd.melt(df_train, value_vars= quali, id_vars = ['SalePrice'])
g = sns.FacetGrid(melted, col = 'variable', margin_titles=True, col_wrap = 2, sharex = False, sharey = False, height = 8)
g.map(boxplot, 'value', 'SalePrice')
# sns.heatmap(df_train.corr())
sns.barplot(x = df_train.corr().SalePrice.sort_values(ascending = False).index ,y = df_train.corr().SalePrice.sort_values(ascending = False))
plt.xticks(rotation = 90)
# The most positive correlated variable with price in your dataset
sns.heatmap(df_train[df_train.corr().SalePrice.sort_values(ascending = False).index[0:10]].corr(),
annot = True,
linewidths=.3)
Same observation about the resultant correlational heatmap:
- 'OverallCond': It's sound good, whether condition of the house is good the price incrise too;
- 'GrLivArea': It's make sense;
- 'GarageCars' and 'GarageArea': Those seems like twins and they have correlation about 0.88, we'll consider just one for analyses;
- 'TotalBsmtSF': Total square feet of basement area, it's make sense too;
- '1stFlrSF': First Floor square feet, it's sounds good, but it sound like 'TotalBsmtSF';
- 'FullBath': Usefull;
- 'TotRmsAbvGrd': is twin with 'GrLivArea';
- 'YearBuilt': Good, but sighly correlation with price
var_used = ['SalePrice', 'OverallCond', 'GrLivArea','GarageCars', 'TotalBsmtSF','FullBath','YearBuilt']
sns.pairplot(df_train[var_used], height = 7)
train = df_train[var_used]
var_used_test = var_used.copy()
var_used_test.remove('SalePrice')
test = df_test[var_used_test]
test
drop_out = list(train.loc[train.GrLivArea > 4500].index)
train = train.drop(labels = drop_out, axis = 0)
sns.set()
ax = sns.scatterplot(x = 'GrLivArea', y = 'SalePrice', data = train)
sns.distplot(train.SalePrice)
print('Skewness of the SalePrice %f' % stats.skew(train.SalePrice))
train.SalePrice = np.log1p(train.SalePrice)
sns.distplot(train.SalePrice)
print('Skewness of the SalePrice %f' % stats.skew(train.SalePrice))
skewness = train.apply(lambda x:stats.skew(x))
skewness
sns.distplot(train.GrLivArea)
#Apply the logtransformation GrLivArea
train.GrLivArea = train.GrLivArea.apply(lambda x: np.log1p(x))
test.GrLivArea = test.GrLivArea.apply(lambda x: np.log1p(x))
sns.distplot(train.GrLivArea)
from sklearn.preprocessing import StandardScaler
X = train.drop('SalePrice', axis = 1)
y = train.SalePrice
# X = pd.get_dummies(X, columns = ['OverallCond', 'GarageCars', 'FullBath'])
X.reset_index(inplace = True, drop = True)
scaler = StandardScaler()
X
def diff(li1, li2):
li_dif = [abs(i - j) for i, j in zip(li1, li2)]
return tuple(li_dif)
def get_dummy(data, col, dim):
dummy = pd.DataFrame()
for i, column in enumerate(col):
dummy = pd.get_dummies(data[column])
# dummy = dummy.merge(pd.get_dummies(X[column]), how = 'inner', left_index= True, right_index=True)
if pd.get_dummies(data[column]).shape[1] < dim[i]:
lack = diff([0, dim[i]], list(pd.get_dummies(data[column]).shape))
zeros = pd.DataFrame(np.zeros(lack))
dummy = pd.concat([dummy, zeros], axis = 1, join = 'inner', ignore_index = True)
data = pd.concat([data, dummy], axis = 1)
data = data.drop(column, axis = 1)
# data = data.merge(dummy, how = 'inner', left_index= True, right_index=True)
return data
X = get_dummy(X, ['OverallCond', 'GarageCars', 'FullBath'], [9, 6, 5])
X.shape
X.columns
X = scaler.fit_transform(X)
X.shape
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
from sklearn.ensemble import GradientBoostingRegressor
reg = GradientBoostingRegressor(random_state=0)
reg = GradientBoostingRegressor(random_state=0).fit(X_train, y_train)
reg.score(X_test, y_test)
from sklearn.metrics import r2_score
yhat = reg.predict(X_test)
print("Mean absolute error: %.4f" % np.mean(np.absolute(yhat - y_test)))
print("Residual sum of squares (MSE): %.4f" % np.mean((yhat - y_test) ** 2))
print("R2-score: %.4f" % r2_score(yhat , y_test) )
sns.scatterplot(x = np.expm1(yhat), y = np.expm1(y_test))
from sklearn.linear_model import Ridge
clf = Ridge(alpha = 8)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
yhat = clf.predict(X_test)
print("Mean absolute error: %.4f" % np.mean(np.absolute(yhat - y_test)))
print("Residual sum of squares (MSE): %.4f" % np.mean((yhat - y_test) ** 2))
print("R2-score: %.4f" % r2_score(yhat , y_test) )
from sklearn.linear_model import Lasso
lasso = Lasso(alpha = 0.000002, max_iter = 1e4).fit(X_train, y_train)
lasso.score(X_test, y_test)
yhat = lasso.predict(X_test)
print("Mean absolute error: %.4f" % np.mean(np.absolute(yhat - y_test)))
print("Residual sum of squares (MSE): %.4f" % np.mean((yhat - y_test) ** 2))
print("R2-score: %.4f" % r2_score(yhat , y_test) )
print(train.isnull().sum())
print('_'*20)
print(test.isnull().sum())
sns.distplot(test.GarageCars)
sns.distplot(test.TotalBsmtSF)
test.fillna(value = 0, inplace = True)
print(test.isnull().sum())
train.groupby(by = 'GarageCars').count()
test.groupby(by = 'GarageCars').count()
train.groupby(by = 'FullBath').count()
test.groupby(by = 'FullBath').count()
# x_test = get_dummy(test, ['OverallCond', 'GarageCars', 'FullBath'], [9, 6, 5])
x_test_2 = pd.get_dummies(test, columns = ['OverallCond', 'GarageCars', 'FullBath'])
test
x_test_2
x_test_2 = scaler.fit_transform(x_test_2)
from sklearn.linear_model import Lasso
lasso_2 = Lasso(alpha = 0.000002, max_iter = 1e4).fit(X, y)
lasso_2.score(X, y)
yhat_final = lasso_2.predict(x_test_2)
sns.distplot(yhat_final)
dicto = {'Id': list(df_test.Id), 'SalePrice': np.expm1(yhat_final).tolist()}
len(dicto['Id'])
submission = pd.DataFrame(dicto)
submission
submission.to_csv('submission.csv',index=False)