Description

Start here if...

You have some experience with R or Python and machine learning basics. This is a perfect competition for data science students who have completed an online course in machine learning and are looking to expand their skill set before trying a featured competition.

Competition Description

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

Practice Skills

Creative feature engineering Advanced regression techniques like random forest and gradient boosting Acknowledgments The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset. !

Evaluation

Goal

It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable.

Metric

Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

Submission File Format

The file should contain a header and have the following format:

Id,SalePrice 1461,169000.1 1462,187724.1233 1463,175221 etc. You can download an example submission file (sample_submission.csv) on the Data page.

Let's get stated!

## load the library

import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
import missingno as msno

import scipy.stats as stats

Load the datasets

url_train = "https://raw.githubusercontent.com/lucastiagooliveira/lucastiagooliveira/master/Kaggle/house-prices-advanced-regression-techniques/train.csv"
url_test = "https://raw.githubusercontent.com/lucastiagooliveira/lucastiagooliveira/master/Kaggle/house-prices-advanced-regression-techniques/test.csv"
df_train = pd.read_csv(url_train)
df_test = pd.read_csv(url_test)
combine = [df_train, df_test]
df_train.head()
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

df_test.head()
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
0 1461 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub ... 120 0 NaN MnPrv NaN 0 6 2010 WD Normal
1 1462 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN Gar2 12500 6 2010 WD Normal
2 1463 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub ... 0 0 NaN MnPrv NaN 0 3 2010 WD Normal
3 1464 60 RL 78.0 9978 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN NaN 0 6 2010 WD Normal
4 1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub ... 144 0 NaN NaN NaN 0 1 2010 WD Normal

5 rows × 80 columns

print(df_train.head().info())
print('_'*50)
print(df_test.head().info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             5 non-null      int64  
 1   MSSubClass     5 non-null      int64  
 2   MSZoning       5 non-null      object 
 3   LotFrontage    5 non-null      float64
 4   LotArea        5 non-null      int64  
 5   Street         5 non-null      object 
 6   Alley          0 non-null      object 
 7   LotShape       5 non-null      object 
 8   LandContour    5 non-null      object 
 9   Utilities      5 non-null      object 
 10  LotConfig      5 non-null      object 
 11  LandSlope      5 non-null      object 
 12  Neighborhood   5 non-null      object 
 13  Condition1     5 non-null      object 
 14  Condition2     5 non-null      object 
 15  BldgType       5 non-null      object 
 16  HouseStyle     5 non-null      object 
 17  OverallQual    5 non-null      int64  
 18  OverallCond    5 non-null      int64  
 19  YearBuilt      5 non-null      int64  
 20  YearRemodAdd   5 non-null      int64  
 21  RoofStyle      5 non-null      object 
 22  RoofMatl       5 non-null      object 
 23  Exterior1st    5 non-null      object 
 24  Exterior2nd    5 non-null      object 
 25  MasVnrType     5 non-null      object 
 26  MasVnrArea     5 non-null      float64
 27  ExterQual      5 non-null      object 
 28  ExterCond      5 non-null      object 
 29  Foundation     5 non-null      object 
 30  BsmtQual       5 non-null      object 
 31  BsmtCond       5 non-null      object 
 32  BsmtExposure   5 non-null      object 
 33  BsmtFinType1   5 non-null      object 
 34  BsmtFinSF1     5 non-null      int64  
 35  BsmtFinType2   5 non-null      object 
 36  BsmtFinSF2     5 non-null      int64  
 37  BsmtUnfSF      5 non-null      int64  
 38  TotalBsmtSF    5 non-null      int64  
 39  Heating        5 non-null      object 
 40  HeatingQC      5 non-null      object 
 41  CentralAir     5 non-null      object 
 42  Electrical     5 non-null      object 
 43  1stFlrSF       5 non-null      int64  
 44  2ndFlrSF       5 non-null      int64  
 45  LowQualFinSF   5 non-null      int64  
 46  GrLivArea      5 non-null      int64  
 47  BsmtFullBath   5 non-null      int64  
 48  BsmtHalfBath   5 non-null      int64  
 49  FullBath       5 non-null      int64  
 50  HalfBath       5 non-null      int64  
 51  BedroomAbvGr   5 non-null      int64  
 52  KitchenAbvGr   5 non-null      int64  
 53  KitchenQual    5 non-null      object 
 54  TotRmsAbvGrd   5 non-null      int64  
 55  Functional     5 non-null      object 
 56  Fireplaces     5 non-null      int64  
 57  FireplaceQu    4 non-null      object 
 58  GarageType     5 non-null      object 
 59  GarageYrBlt    5 non-null      float64
 60  GarageFinish   5 non-null      object 
 61  GarageCars     5 non-null      int64  
 62  GarageArea     5 non-null      int64  
 63  GarageQual     5 non-null      object 
 64  GarageCond     5 non-null      object 
 65  PavedDrive     5 non-null      object 
 66  WoodDeckSF     5 non-null      int64  
 67  OpenPorchSF    5 non-null      int64  
 68  EnclosedPorch  5 non-null      int64  
 69  3SsnPorch      5 non-null      int64  
 70  ScreenPorch    5 non-null      int64  
 71  PoolArea       5 non-null      int64  
 72  PoolQC         0 non-null      object 
 73  Fence          0 non-null      object 
 74  MiscFeature    0 non-null      object 
 75  MiscVal        5 non-null      int64  
 76  MoSold         5 non-null      int64  
 77  YrSold         5 non-null      int64  
 78  SaleType       5 non-null      object 
 79  SaleCondition  5 non-null      object 
 80  SalePrice      5 non-null      int64  
dtypes: float64(3), int64(35), object(43)
memory usage: 3.3+ KB
None
__________________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             5 non-null      int64  
 1   MSSubClass     5 non-null      int64  
 2   MSZoning       5 non-null      object 
 3   LotFrontage    5 non-null      float64
 4   LotArea        5 non-null      int64  
 5   Street         5 non-null      object 
 6   Alley          0 non-null      object 
 7   LotShape       5 non-null      object 
 8   LandContour    5 non-null      object 
 9   Utilities      5 non-null      object 
 10  LotConfig      5 non-null      object 
 11  LandSlope      5 non-null      object 
 12  Neighborhood   5 non-null      object 
 13  Condition1     5 non-null      object 
 14  Condition2     5 non-null      object 
 15  BldgType       5 non-null      object 
 16  HouseStyle     5 non-null      object 
 17  OverallQual    5 non-null      int64  
 18  OverallCond    5 non-null      int64  
 19  YearBuilt      5 non-null      int64  
 20  YearRemodAdd   5 non-null      int64  
 21  RoofStyle      5 non-null      object 
 22  RoofMatl       5 non-null      object 
 23  Exterior1st    5 non-null      object 
 24  Exterior2nd    5 non-null      object 
 25  MasVnrType     5 non-null      object 
 26  MasVnrArea     5 non-null      float64
 27  ExterQual      5 non-null      object 
 28  ExterCond      5 non-null      object 
 29  Foundation     5 non-null      object 
 30  BsmtQual       5 non-null      object 
 31  BsmtCond       5 non-null      object 
 32  BsmtExposure   5 non-null      object 
 33  BsmtFinType1   5 non-null      object 
 34  BsmtFinSF1     5 non-null      float64
 35  BsmtFinType2   5 non-null      object 
 36  BsmtFinSF2     5 non-null      float64
 37  BsmtUnfSF      5 non-null      float64
 38  TotalBsmtSF    5 non-null      float64
 39  Heating        5 non-null      object 
 40  HeatingQC      5 non-null      object 
 41  CentralAir     5 non-null      object 
 42  Electrical     5 non-null      object 
 43  1stFlrSF       5 non-null      int64  
 44  2ndFlrSF       5 non-null      int64  
 45  LowQualFinSF   5 non-null      int64  
 46  GrLivArea      5 non-null      int64  
 47  BsmtFullBath   5 non-null      float64
 48  BsmtHalfBath   5 non-null      float64
 49  FullBath       5 non-null      int64  
 50  HalfBath       5 non-null      int64  
 51  BedroomAbvGr   5 non-null      int64  
 52  KitchenAbvGr   5 non-null      int64  
 53  KitchenQual    5 non-null      object 
 54  TotRmsAbvGrd   5 non-null      int64  
 55  Functional     5 non-null      object 
 56  Fireplaces     5 non-null      int64  
 57  FireplaceQu    2 non-null      object 
 58  GarageType     5 non-null      object 
 59  GarageYrBlt    5 non-null      float64
 60  GarageFinish   5 non-null      object 
 61  GarageCars     5 non-null      float64
 62  GarageArea     5 non-null      float64
 63  GarageQual     5 non-null      object 
 64  GarageCond     5 non-null      object 
 65  PavedDrive     5 non-null      object 
 66  WoodDeckSF     5 non-null      int64  
 67  OpenPorchSF    5 non-null      int64  
 68  EnclosedPorch  5 non-null      int64  
 69  3SsnPorch      5 non-null      int64  
 70  ScreenPorch    5 non-null      int64  
 71  PoolArea       5 non-null      int64  
 72  PoolQC         0 non-null      object 
 73  Fence          2 non-null      object 
 74  MiscFeature    1 non-null      object 
 75  MiscVal        5 non-null      int64  
 76  MoSold         5 non-null      int64  
 77  YrSold         5 non-null      int64  
 78  SaleType       5 non-null      object 
 79  SaleCondition  5 non-null      object 
dtypes: float64(11), int64(26), object(43)
memory usage: 3.2+ KB
None
sns.set()

sns.distplot(df_train.SalePrice, color = 'b')
<AxesSubplot:xlabel='SalePrice'>
print('Skewness: %f' % df_train.SalePrice.skew())
print('Kurtosis: %f' % df_train.SalePrice.kurt())
Skewness: 1.882876
Kurtosis: 6.536282

Separation the type - Between: quantitative and qualitative

quant = [i for i in df_train.columns if df_train[i].dtypes != object]
quali = [i for i in df_train.columns if df_train[i].dtypes == object]

quant.remove('Id')
quant.remove('SalePrice')

# quant = df_train[quant]
# quali = df_train[quali]

target = df_train.SalePrice
# Print quantitative varibles

sns.set(style="darkgrid")

melted = pd.melt(df_train, value_vars= quant)

g = sns.FacetGrid(melted, col = 'variable', margin_titles=True, col_wrap = 3, sharex = False, sharey = False, height = 5)

g.map(sns.distplot, "value", color="steelblue")
<seaborn.axisgrid.FacetGrid at 0x1a367b5a2e8>
# Print qualitative varibles

def boxplot(x, y, **kwargs):
    sns.boxplot(x=x, y=y)
    x = plt.xticks(rotation = 90)
sns.set()

melted = pd.melt(df_train, value_vars= quali, id_vars = ['SalePrice'])

g = sns.FacetGrid(melted, col = 'variable', margin_titles=True, col_wrap = 2, sharex = False, sharey = False, height = 8)

g.map(boxplot, 'value', 'SalePrice')
<seaborn.axisgrid.FacetGrid at 0x1a36b712358>

Correlation with data

# sns.heatmap(df_train.corr())
sns.barplot(x = df_train.corr().SalePrice.sort_values(ascending = False).index ,y = df_train.corr().SalePrice.sort_values(ascending = False))
plt.xticks(rotation = 90)
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
        34, 35, 36, 37]),
 [Text(0, 0, 'SalePrice'),
  Text(1, 0, 'OverallQual'),
  Text(2, 0, 'GrLivArea'),
  Text(3, 0, 'GarageCars'),
  Text(4, 0, 'GarageArea'),
  Text(5, 0, 'TotalBsmtSF'),
  Text(6, 0, '1stFlrSF'),
  Text(7, 0, 'FullBath'),
  Text(8, 0, 'TotRmsAbvGrd'),
  Text(9, 0, 'YearBuilt'),
  Text(10, 0, 'YearRemodAdd'),
  Text(11, 0, 'GarageYrBlt'),
  Text(12, 0, 'MasVnrArea'),
  Text(13, 0, 'Fireplaces'),
  Text(14, 0, 'BsmtFinSF1'),
  Text(15, 0, 'LotFrontage'),
  Text(16, 0, 'WoodDeckSF'),
  Text(17, 0, '2ndFlrSF'),
  Text(18, 0, 'OpenPorchSF'),
  Text(19, 0, 'HalfBath'),
  Text(20, 0, 'LotArea'),
  Text(21, 0, 'BsmtFullBath'),
  Text(22, 0, 'BsmtUnfSF'),
  Text(23, 0, 'BedroomAbvGr'),
  Text(24, 0, 'ScreenPorch'),
  Text(25, 0, 'PoolArea'),
  Text(26, 0, 'MoSold'),
  Text(27, 0, '3SsnPorch'),
  Text(28, 0, 'BsmtFinSF2'),
  Text(29, 0, 'BsmtHalfBath'),
  Text(30, 0, 'MiscVal'),
  Text(31, 0, 'Id'),
  Text(32, 0, 'LowQualFinSF'),
  Text(33, 0, 'YrSold'),
  Text(34, 0, 'OverallCond'),
  Text(35, 0, 'MSSubClass'),
  Text(36, 0, 'EnclosedPorch'),
  Text(37, 0, 'KitchenAbvGr')])
# The most positive correlated variable with price in your dataset
sns.heatmap(df_train[df_train.corr().SalePrice.sort_values(ascending = False).index[0:10]].corr(),
            annot = True,
            linewidths=.3)
<AxesSubplot:>

Same observation about the resultant correlational heatmap:

- 'OverallCond': It's sound good, whether condition of the house is good the price incrise too;
- 'GrLivArea': It's make sense;
- 'GarageCars' and 'GarageArea': Those seems like twins and they have correlation about 0.88, we'll consider just one for analyses;
- 'TotalBsmtSF': Total square feet of basement area, it's make sense too;
- '1stFlrSF': First Floor square feet, it's sounds good, but it sound like 'TotalBsmtSF';
- 'FullBath': Usefull;
- 'TotRmsAbvGrd': is twin with 'GrLivArea';
- 'YearBuilt': Good, but sighly correlation with price
var_used = ['SalePrice', 'OverallCond', 'GrLivArea','GarageCars', 'TotalBsmtSF','FullBath','YearBuilt']

sns.pairplot(df_train[var_used], height = 7)
<seaborn.axisgrid.PairGrid at 0x1a378ed4f98>
train = df_train[var_used]
var_used_test = var_used.copy()
var_used_test.remove('SalePrice')
test = df_test[var_used_test]
test
OverallCond GrLivArea GarageCars TotalBsmtSF FullBath YearBuilt
0 6 896 1.0 882.0 1 1961
1 6 1329 1.0 1329.0 1 1958
2 5 1629 2.0 928.0 2 1997
3 6 1604 2.0 926.0 2 1998
4 5 1280 2.0 1280.0 2 1992
... ... ... ... ... ... ...
1454 7 1092 0.0 546.0 1 1970
1455 5 1092 1.0 546.0 1 1970
1456 7 1224 2.0 1224.0 1 1960
1457 5 970 0.0 912.0 1 1992
1458 5 2000 3.0 996.0 2 1993

1459 rows × 6 columns

Exclude the outliers

drop_out = list(train.loc[train.GrLivArea > 4500].index)
train = train.drop(labels = drop_out, axis = 0)
sns.set()

ax = sns.scatterplot(x = 'GrLivArea', y = 'SalePrice', data = train)

Regularizing the saleprice

sns.distplot(train.SalePrice)
<AxesSubplot:xlabel='SalePrice'>
print('Skewness of the SalePrice %f' % stats.skew(train.SalePrice))
Skewness of the SalePrice 1.879360
train.SalePrice = np.log1p(train.SalePrice)
sns.distplot(train.SalePrice)
print('Skewness of the SalePrice %f' % stats.skew(train.SalePrice))
Skewness of the SalePrice 0.121455
skewness = train.apply(lambda x:stats.skew(x))
skewness
SalePrice      0.121455
OverallCond    0.690324
GrLivArea      1.009951
GarageCars    -0.342025
TotalBsmtSF    0.511177
FullBath       0.031239
YearBuilt     -0.611665
dtype: float64
sns.distplot(train.GrLivArea)
<AxesSubplot:xlabel='GrLivArea'>
#Apply the logtransformation GrLivArea

train.GrLivArea = train.GrLivArea.apply(lambda x: np.log1p(x))
test.GrLivArea = test.GrLivArea.apply(lambda x: np.log1p(x))
sns.distplot(train.GrLivArea)
<AxesSubplot:xlabel='GrLivArea'>

Preprocessing data

from sklearn.preprocessing import StandardScaler

X = train.drop('SalePrice', axis = 1)

y = train.SalePrice

# X = pd.get_dummies(X, columns = ['OverallCond', 'GarageCars', 'FullBath'])

X.reset_index(inplace = True, drop = True)

scaler = StandardScaler()

X
OverallCond GrLivArea GarageCars TotalBsmtSF FullBath YearBuilt
0 5 7.444833 2 856 2 2003
1 8 7.141245 2 1262 2 1976
2 5 7.488294 2 920 2 2001
3 5 7.448916 3 756 1 1915
4 5 7.695758 3 1145 2 2000
... ... ... ... ... ... ...
1453 5 7.407318 2 953 2 1999
1454 6 7.637234 2 1542 2 1978
1455 9 7.758333 1 1152 2 1941
1456 6 6.983790 1 1078 1 1950
1457 6 7.136483 1 1256 1 1965

1458 rows × 6 columns

def diff(li1, li2): 
    li_dif = [abs(i - j) for i, j in zip(li1, li2)]
    return tuple(li_dif)

def get_dummy(data, col, dim):
    dummy = pd.DataFrame()
    for i, column in enumerate(col):
        dummy = pd.get_dummies(data[column])
#         dummy = dummy.merge(pd.get_dummies(X[column]), how = 'inner', left_index= True, right_index=True)
        if pd.get_dummies(data[column]).shape[1] < dim[i]:
            lack = diff([0, dim[i]], list(pd.get_dummies(data[column]).shape))
            zeros = pd.DataFrame(np.zeros(lack))
            dummy = pd.concat([dummy, zeros], axis = 1, join = 'inner', ignore_index = True)       
        data = pd.concat([data, dummy], axis = 1)
        data = data.drop(column, axis = 1)
#         data = data.merge(dummy, how = 'inner', left_index= True, right_index=True)
    return data
X = get_dummy(X, ['OverallCond', 'GarageCars', 'FullBath'], [9, 6, 5])
X.shape
(1458, 23)
X.columns
Index([  'GrLivArea', 'TotalBsmtSF',   'YearBuilt',             1,
                   2,             3,             4,             5,
                   6,             7,             8,             9,
                   0,             1,             2,             3,
                   4,             5,             0,             1,
                   2,             3,             4],
      dtype='object')
X = scaler.fit_transform(X)
X.shape
(1458, 23)

Models

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

Gradient Boosting Regressor

from sklearn.ensemble import GradientBoostingRegressor
reg = GradientBoostingRegressor(random_state=0)

reg = GradientBoostingRegressor(random_state=0).fit(X_train, y_train)

reg.score(X_test, y_test)
0.8442008349303682
from sklearn.metrics import r2_score

yhat = reg.predict(X_test)

print("Mean absolute error: %.4f" % np.mean(np.absolute(yhat - y_test)))
print("Residual sum of squares (MSE): %.4f" % np.mean((yhat - y_test) ** 2))
print("R2-score: %.4f" % r2_score(yhat , y_test) )
Mean absolute error: 0.1128
Residual sum of squares (MSE): 0.0245
R2-score: 0.8064
sns.scatterplot(x = np.expm1(yhat), y = np.expm1(y_test))
<AxesSubplot:ylabel='SalePrice'>

Ridge

from sklearn.linear_model import Ridge

clf = Ridge(alpha = 8)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
0.8596645522643142
yhat = clf.predict(X_test)

print("Mean absolute error: %.4f" % np.mean(np.absolute(yhat - y_test)))
print("Residual sum of squares (MSE): %.4f" % np.mean((yhat - y_test) ** 2))
print("R2-score: %.4f" % r2_score(yhat , y_test) )
Mean absolute error: 0.1086
Residual sum of squares (MSE): 0.0220
R2-score: 0.8244

Lasso

from sklearn.linear_model import Lasso

lasso = Lasso(alpha = 0.000002, max_iter = 1e4).fit(X_train, y_train)
lasso.score(X_test, y_test)
0.8593129526718498
yhat = lasso.predict(X_test)

print("Mean absolute error: %.4f" % np.mean(np.absolute(yhat - y_test)))
print("Residual sum of squares (MSE): %.4f" % np.mean((yhat - y_test) ** 2))
print("R2-score: %.4f" % r2_score(yhat , y_test) )
Mean absolute error: 0.1090
Residual sum of squares (MSE): 0.0221
R2-score: 0.8255

Missing values

print(train.isnull().sum())
print('_'*20)
print(test.isnull().sum())
SalePrice      0
OverallCond    0
GrLivArea      0
GarageCars     0
TotalBsmtSF    0
FullBath       0
YearBuilt      0
dtype: int64
____________________
OverallCond    0
GrLivArea      0
GarageCars     1
TotalBsmtSF    1
FullBath       0
YearBuilt      0
dtype: int64

Imputation for missing values

sns.distplot(test.GarageCars)
<AxesSubplot:xlabel='GarageCars'>
sns.distplot(test.TotalBsmtSF)
<AxesSubplot:xlabel='TotalBsmtSF'>
test.fillna(value = 0, inplace = True)

print(test.isnull().sum())
OverallCond    0
GrLivArea      0
GarageCars     0
TotalBsmtSF    0
FullBath       0
YearBuilt      0
dtype: int64
train.groupby(by = 'GarageCars').count()
SalePrice OverallCond GrLivArea TotalBsmtSF FullBath YearBuilt
GarageCars
0 81 81 81 81 81 81
1 369 369 369 369 369 369
2 823 823 823 823 823 823
3 180 180 180 180 180 180
4 5 5 5 5 5 5
test.groupby(by = 'GarageCars').count()
OverallCond GrLivArea TotalBsmtSF FullBath YearBuilt
GarageCars
0.0 77 77 77 77 77
1.0 407 407 407 407 407
2.0 770 770 770 770 770
3.0 193 193 193 193 193
4.0 11 11 11 11 11
5.0 1 1 1 1 1
train.groupby(by = 'FullBath').count()
SalePrice OverallCond GrLivArea GarageCars TotalBsmtSF YearBuilt
FullBath
0 9 9 9 9 9 9
1 650 650 650 650 650 650
2 767 767 767 767 767 767
3 32 32 32 32 32 32
test.groupby(by = 'FullBath').count()
OverallCond GrLivArea GarageCars TotalBsmtSF YearBuilt
FullBath
0 3 3 3 3 3
1 659 659 659 659 659
2 762 762 762 762 762
3 31 31 31 31 31
4 4 4 4 4 4
# x_test = get_dummy(test, ['OverallCond', 'GarageCars', 'FullBath'], [9, 6, 5])
x_test_2 = pd.get_dummies(test, columns = ['OverallCond', 'GarageCars', 'FullBath'])
test
OverallCond GrLivArea GarageCars TotalBsmtSF FullBath YearBuilt
0 6 6.799056 1.0 882.0 1 1961
1 6 7.192934 1.0 1329.0 1 1958
2 5 7.396335 2.0 928.0 2 1997
3 6 7.380879 2.0 926.0 2 1998
4 5 7.155396 2.0 1280.0 2 1992
... ... ... ... ... ... ...
1454 7 6.996681 0.0 546.0 1 1970
1455 5 6.996681 1.0 546.0 1 1970
1456 7 7.110696 2.0 1224.0 1 1960
1457 5 6.878326 0.0 912.0 1 1992
1458 5 7.601402 3.0 996.0 2 1993

1459 rows × 6 columns

x_test_2
GrLivArea TotalBsmtSF YearBuilt OverallCond_1 OverallCond_2 OverallCond_3 OverallCond_4 OverallCond_5 OverallCond_6 OverallCond_7 ... GarageCars_1.0 GarageCars_2.0 GarageCars_3.0 GarageCars_4.0 GarageCars_5.0 FullBath_0 FullBath_1 FullBath_2 FullBath_3 FullBath_4
0 6.799056 882.0 1961 0 0 0 0 0 1 0 ... 1 0 0 0 0 0 1 0 0 0
1 7.192934 1329.0 1958 0 0 0 0 0 1 0 ... 1 0 0 0 0 0 1 0 0 0
2 7.396335 928.0 1997 0 0 0 0 1 0 0 ... 0 1 0 0 0 0 0 1 0 0
3 7.380879 926.0 1998 0 0 0 0 0 1 0 ... 0 1 0 0 0 0 0 1 0 0
4 7.155396 1280.0 1992 0 0 0 0 1 0 0 ... 0 1 0 0 0 0 0 1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1454 6.996681 546.0 1970 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 1 0 0 0
1455 6.996681 546.0 1970 0 0 0 0 1 0 0 ... 1 0 0 0 0 0 1 0 0 0
1456 7.110696 1224.0 1960 0 0 0 0 0 0 1 ... 0 1 0 0 0 0 1 0 0 0
1457 6.878326 912.0 1992 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 1 0 0 0
1458 7.601402 996.0 1993 0 0 0 0 1 0 0 ... 0 0 1 0 0 0 0 1 0 0

1459 rows × 23 columns

x_test_2 = scaler.fit_transform(x_test_2)

Final Model

from sklearn.linear_model import Lasso

lasso_2 = Lasso(alpha = 0.000002, max_iter = 1e4).fit(X, y)
lasso_2.score(X, y)
0.8526452970022899
yhat_final = lasso_2.predict(x_test_2)
sns.distplot(yhat_final)
<AxesSubplot:>
dicto = {'Id': list(df_test.Id), 'SalePrice': np.expm1(yhat_final).tolist()}
len(dicto['Id'])
1459
submission = pd.DataFrame(dicto)
submission
Id SalePrice
0 1461 113976.301594
1 1462 158335.287355
2 1463 191894.306849
3 1464 202444.764972
4 1465 174599.506222
... ... ...
1454 2915 121639.299342
1455 2916 117651.621741
1456 2917 172350.601297
1457 2918 117082.978333
1458 2919 250747.137098

1459 rows × 2 columns

submission.to_csv('submission.csv',index=False)