(This fictional report was prepared as a class assignment for MUSA507 at the University of Pennsylvania)
My client, Zillow, has realized that the predictions of home prices in the Boston area which are currently on the site are not sufficiently accurate. This lack of accuracy is putting the site’s credibility in question and hurting its ability to attract and retain customers in the region.
Accurately predicting home prices is difficult. Home prices are a function of many complex considerations including the physical characteristics, location, and other factors including seasonal variations and the personal preferences of buyers and sellers.
To achieve better prediction accuracy, I have collected data to capture as many of the factors listed above as possible. More specifically, I worked to capture the variation in prices due to both physical characteristics of homes, and the underlying spatial patterns of prices in the region.
The resulting model was able to improve predictions considerably, with a new error margin of only about 11%-12%.
The map below shows the observation points in the data set.
library(dplyr)
options(scipen=999)
df <- read.csv("WrangledData.csv")
df2 <- select(df, -Parcel_No, -Latitude_1, -Longitud_1)
df$SalePrice <- sapply(df$SalePrice, as.numeric)
df_Train <- df2[df2$SalePrice > 0,]
df_Test <- df2[df2$SalePrice == 0,]
library(leaflet)
leaflet(data = df) %>% setView(lng = -71.08, lat = 42.33, zoom = 12) %>% addProviderTiles(providers$CartoDB.Positron) %>%
addProviderTiles(providers$Stamen.TonerLines, options = providerTileOptions(opacity = 0.35)) %>%
addProviderTiles( providers$Stamen.TonerLabels) %>%
addCircleMarkers(lng = df$Longitud_1, lat = df$Latitude_1, radius = 3, color = ifelse(df$SalePrice > 0, "blue", "red"), stroke = FALSE, fillOpacity = 0.5, clusterOptions = markerClusterOptions(disableClusteringAtZoom = 13), label = ~as.character(paste("Sale Number: ", UniqueSale, ", " ,"Sale Price: $", SalePrice))) %>%
addLegend("topright", colors = c("red", "blue"), labels = c("Test-Set", "Training-Set"),
title = "Training/Testing Set",
opacity = 1
)
Data variables came from various sources. Physical housing characteristics, and the sales date were provided in the original Zillow data set. Some variables from the original data set with many categorical values or uneven distribution of values (e.g. sale date, year remodeled) were modified in order to increase predictive capacity.
New variables which model spatial considerations were added to the data set. These variables were created through spatial analyses of the original data set, as well as through the incorporation of data gathered from open data sources on the web.
Description of the variables is available in the Appendix.
library(stargazer)
stargazer(df_Train, type="text", title = "Variable Summary Statistics")
##
## Variable Summary Statistics
## ===================================================================
## Statistic N Mean St. Dev. Min Max
## -------------------------------------------------------------------
## UniqueSale 1,323 739.166 428.927 1 1,485
## SalePrice 1,323 642,335.000 565,859.000 200,000 11,600,000
## LivingArea 1,323 2,310.966 993.217 573 9,908
## LAND_SF 1,323 4,562.249 3,382.821 498 63,941
## YR_BUILT 1,323 1,916.409 30.242 1,802 2,013
## YR_REMOD 1,323 749.477 968.163 0 2,013
## Last_Work 1,323 1,951.161 45.022 1,830 2,013
## NewlyRemodeled 1,323 0.138 0.345 0 1
## GROSS_AREA 1,323 3,622.726 1,422.172 826 11,919
## LIVING_ARE 1,323 2,280.961 994.508 573 9,908
## NUM_FLOORS 1,323 2.187 0.646 1.000 4.000
## R_TOTAL_RM 1,323 9.760 3.766 3 21
## R_BDRMS 1,323 4.494 2.008 1 14
## R_FULL_BTH 1,323 1.995 0.894 1 8
## R_HALF_BTH 1,323 0.363 0.547 0 3
## R_KITCH 1,323 1.697 0.822 1 4
## R_FPLACE 1,323 0.407 0.758 0 5
## ZIPCODE 1,323 2,128.682 16.894 2,108 2,467
## PTYPE 1,323 103.912 34.071 101 985
## OWN_OCC 1,323 0.559 0.497 0 1
## HEAT_SYS 1,323 0.986 0.116 0 1
## C_AC 1,323 0.145 0.352 0 1
## FeetToParks 1,323 492.798 353.428 3 1,964
## WalkScore 1,323 85.565 5.534 70 98
## TransitSco 1,323 68.864 14.870 44 100
## BikeScore 1,323 65.729 12.547 53 95
## FeetToMetro 1,323 5,674.466 4,965.023 187 19,768
## MEDHHINC 1,323 73,351.130 30,775.610 9,756 166,375
## PCTBACHMOR 1,323 39.403 22.517 0 97
## PCTOWNEROC 1,323 44.602 22.840 0 96
## PCTVACANT 1,323 4.594 3.061 0 28
## PCTWHITE 1,323 59.865 31.201 2 100
## CrimeIndex 1,323 3.964 1.931 1.000 9.610
## NearUni 1,323 0.168 0.374 0 1
## SchoolGrade 1,323 3.125 1.711 1 9
## DistToCBD 1,323 25,517.900 12,816.110 2,231 52,892
## NearImpBldg 1,323 0.011 0.106 0 1
## NearCommonwealth 1,323 0.003 0.055 0 1
## DistToPoor 1,323 3,610.567 3,149.793 0 15,573
## NearAP 1,323 0.021 0.144 0 1
## Ave_SalePr 1,323 638,684.900 447,084.200 283,541 4,470,000
## Dist_AP 1,323 31,408.400 14,682.450 2,894.213 57,307.500
## Dist_SC 1,323 2,164.640 942.232 420.376 4,909.092
## Dist_Major_Road 1,323 2,720.628 2,543.041 3.011 15,749.200
## -------------------------------------------------------------------
Avoiding multi-collinearity (high level of correlation between predictors) can help improve the accuracy of the model. To examine this, The correlation matrix of the numeric variables in the model is shown below. Variables with correlation greater than 0.8 will be removed from the model.
dfNum <- select(df_Train, -UniqueSale, -SaleSeason, -STRUCTURE_, -R_BLDG_STY, -SaleMonth, -Style , -LU, -R_ROOF_TYP, -HEAT_SYS, -R_EXT_FIN, -Neighborhood)
#Correlation Matrix
CorMatrix <- cor(dfNum)
library(corrplot)
corrplot(CorMatrix, method = "color", order = "AOE", addCoef.col="grey", type = "upper", addCoefasPercent = FALSE, number.cex = .7)
#Removing multi-colinear variables (>.80)
df2 <- select(df2, -GROSS_AREA, -LivingArea, -YR_REMOD, -R_BDRMS, -R_KITCH, -Dist_AP)
df_Train <- df2[df2$SalePrice > 0,]
df_Test <- df2[df2$SalePrice == 0,]
The OLS model works best when the predictors and dependent variable are normally distributed. The distribution of some predictors may be brought closer to a normal distribution by log-transforming them. The plots below show the current distributions of the continuous predictors.
#Analysis of continuous predictors
dfCont <- select(dfNum, -GROSS_AREA, -LivingArea, -YR_REMOD, -NewlyRemodeled, -R_BDRMS, -R_KITCH, -Dist_AP, -NearCommonwealth, -NearImpBldg, -NearUni, -C_AC, -OWN_OCC, -PTYPE, -ZIPCODE,
-R_FPLACE, -R_HALF_BTH, -R_FULL_BTH, -NearAP)
#Distribution analysis
library(reshape2)
dfContMelted <- melt(dfCont)
library(ggplot2)
ggplot(data = dfContMelted, mapping = aes(x = value)) +
geom_histogram(bins = 30) + facet_wrap(~variable, scales = 'free_x') + theme(axis.text.x = element_blank())
Predictors which are negatively skewed are the best candidates for normalizing by using log-transformation. The new distributions of the transformed predictors are shown below.
#Log-Transforming to normalize selected predictors
ggplot(df2, aes(x=LAND_SF)) + geom_histogram()
ggplot(df2, aes(x=log(LAND_SF))) + geom_histogram()
#Log-Transforming to normalize selected predictors
ggplot(df2, aes(x=LIVING_ARE)) + geom_histogram()
ggplot(df2, aes(x=log(LIVING_ARE))) + geom_histogram()
#Log-Transforming to normalize selected predictors
ggplot(df2, aes(x=PCTVACANT)) + geom_histogram() + stat_bin(bins = 10)
ggplot(df2, aes(x=log(PCTVACANT + 1))) + geom_histogram() + stat_bin(bins = 10)
#Log-Transforming to normalize selected predictors
ggplot(df2, aes(x=Dist_Major_Road)) + geom_histogram() + stat_bin(bins = 20)
ggplot(df2, aes(x=log(Dist_Major_Road + 1))) + geom_histogram() + stat_bin(bins = 20)
#Log-Transforming to normalize selected predictors
ggplot(df2, aes(x=Ave_SalePr)) + geom_histogram() + stat_bin(bins = 30)
ggplot(df2, aes(x=log(Ave_SalePr))) + geom_histogram() + stat_bin(bins = 30)
To generate the price predictions, I will use a Hedonic OLS Regression model. This method evaluates the direction and strength of the relationship between the dependent variable in question (home prices) and the many factors (predictors) which may affect it. The model can estimate the effect of each of our predictors on sale prices while holding all other predictors constant, thereby allowing us to consider the effect of different variables concurrently.
To train the model, I created a training data set (data set with known sale prices) which included 1323 observations to “train” the regression model to predict the home sales price in the test data set (the data set of homes with unknown, or 0, prices). Using this training data, it is possible to calibrate the regression coefficients to model home prices based on the data. Lastly, I used this “trained” model to predict the sale prices in the test set.
In evaluating the predictive ability of our model, I took two separate approaches. The first approach was “In-sample training”, in which I divided the training data-set into two groups, and used one of the groups to predict the prices in the other. The second approach was a 10-fold cross-validation algorithm, which randomly divids the training set into ten equal “folds”, and one by one predicts prices for each of the folds using the remaining nine folds.
To examine whether the model was sufficiently capturing spatial structure of prices, I used the Moran’s I method. This method evaluates whether the model’s errors are clustered in space to a statistically-significant degree (which would indicate some spatial dynamic that was not accounted for in the model).
The model-building process is shown below.
reg1 <- lm(log(SalePrice) ~ ., data = df_Train %>%
as.data.frame %>% dplyr::select(-UniqueSale))
step$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## log(SalePrice) ~ YR_BUILT + Last_Work + NewlyRemodeled + NUM_FLOORS +
## R_TOTAL_RM + R_FULL_BTH + R_HALF_BTH + R_FPLACE + SaleSeason +
## SaleMonth + Style + ZIPCODE + PTYPE + OWN_OCC + LU + STRUCTURE_ +
## R_BLDG_STY + R_ROOF_TYP + R_EXT_FIN + HEAT_SYS + C_AC + FeetToParks +
## Neighborhood + WalkScore + TransitSco + BikeScore + FeetToMetro +
## MEDHHINC + PCTBACHMOR + PCTOWNEROC + PCTWHITE + CrimeIndex +
## NearUni + SchoolGrade + DistToCBD + NearImpBldg + NearCommonwealth +
## DistToPoor + NearAP + Dist_SC + LogLAND_SF + LogLIVING_ARE +
## LogPCTVACANT + LogDist_Major_Road + LogAve_SalePr
##
## Final Model:
## log(SalePrice) ~ Last_Work + NewlyRemodeled + NUM_FLOORS + R_FULL_BTH +
## R_HALF_BTH + R_FPLACE + SaleMonth + Style + PTYPE + LU +
## R_BLDG_STY + R_EXT_FIN + C_AC + Neighborhood + FeetToMetro +
## PCTBACHMOR + PCTWHITE + CrimeIndex + NearUni + NearImpBldg +
## NearAP + Dist_SC + LogLAND_SF + LogLIVING_ARE + LogAve_SalePr
##
##
## Step Df Deviance Resid. Df Resid. Dev
## 1 1216 28.71204
## 2 - BikeScore 0 0.0000000000000000000 1216 28.71204
## 3 - TransitSco 0 0.0000000000000000000 1216 28.71204
## 4 - WalkScore 0 0.0000000000000000000 1216 28.71204
## 5 - SaleSeason 0 0.0000000000003552714 1216 28.71204
## 6 - R_ROOF_TYP 5 0.0946874920781439755 1221 28.80673
## 7 - STRUCTURE_ 2 0.0358108356566724240 1223 28.84254
## 8 - FeetToParks 1 0.0000047751824361342 1224 28.84254
## 9 - R_TOTAL_RM 1 0.0000139125347544677 1225 28.84256
## 10 - LogDist_Major_Road 1 0.0001061660543335563 1226 28.84266
## 11 - PCTOWNEROC 1 0.0002904673328529839 1227 28.84296
## 12 - MEDHHINC 1 0.0008064345835343545 1228 28.84376
## 13 - OWN_OCC 1 0.0030725642581721502 1229 28.84683
## 14 - HEAT_SYS 1 0.0059915654671662821 1230 28.85283
## 15 - DistToCBD 1 0.0138947193882579256 1231 28.86672
## 16 - YR_BUILT 1 0.0138758735296633517 1232 28.88060
## 17 - LogPCTVACANT 1 0.0151253646420634880 1233 28.89572
## 18 - SchoolGrade 1 0.0202790166078798961 1234 28.91600
## 19 - DistToPoor 1 0.0214955404639987080 1235 28.93750
## 20 - ZIPCODE 1 0.0265304549993281569 1236 28.96403
## 21 - NearCommonwealth 1 0.0328592482080054538 1237 28.99689
## AIC
## 1 -4853.541
## 2 -4853.541
## 3 -4853.541
## 4 -4853.541
## 5 -4853.541
## 6 -4859.185
## 7 -4861.541
## 8 -4863.541
## 9 -4865.540
## 10 -4867.535
## 11 -4869.522
## 12 -4871.485
## 13 -4873.344
## 14 -4875.069
## 15 -4876.432
## 16 -4877.797
## 17 -4879.104
## 18 -4880.176
## 19 -4881.193
## 20 -4881.980
## 21 -4882.480
reg2 <- lm(log(SalePrice) ~ ., data = df_Train %>%
as.data.frame %>% dplyr::select(-UniqueSale, -PCTOWNEROC, -DistToPoor,
-DistToCBD, -SchoolGrade, -MEDHHINC, -WalkScore, -TransitSco,
-BikeScore, -FeetToParks, -HEAT_SYS, -R_EXT_FIN, -R_BLDG_STY, -R_ROOF_TYP, -STRUCTURE_,
-OWN_OCC, -ZIPCODE, -Style, -SaleMonth))
library(stargazer)
reg3 <- lm(log(SalePrice) ~ ., data = df_Train %>%
as.data.frame %>% dplyr::select(-UniqueSale, -LogDist_Major_Road, -PCTOWNEROC, -DistToPoor,
-DistToCBD, -SchoolGrade, -MEDHHINC, -WalkScore, -TransitSco,
-BikeScore, -FeetToParks, -HEAT_SYS, -R_EXT_FIN, -R_BLDG_STY, -R_ROOF_TYP, -STRUCTURE_,
-OWN_OCC, -ZIPCODE, -Style, -SaleMonth, -YR_BUILT, -R_TOTAL_RM, -NearImpBldg, -NearCommonwealth,
-NearCommonwealth, -LogPCTVACANT))
stargazer(reg1, reg2, reg3, type="text", title = "Model Outputs", align=TRUE, no.space=TRUE, single.row=TRUE, ci=FALSE, column.labels=c("Model 1","Model 2", "Model 3"))
##
## Model Outputs
## ==========================================================================================================
## Dependent variable:
## --------------------------------------------------------------------------------
## log(SalePrice)
## Model 1 Model 2 Model 3
## (1) (2) (3)
## ----------------------------------------------------------------------------------------------------------
## YR_BUILT 0.0001 (0.0002) 0.00003 (0.0002)
## Last_Work 0.0003** (0.0001) 0.0004*** (0.0001) 0.0003*** (0.0001)
## NewlyRemodeled 0.045*** (0.016) 0.038** (0.016) 0.040*** (0.016)
## NUM_FLOORS 0.036* (0.019) 0.064*** (0.014) 0.064*** (0.014)
## R_TOTAL_RM -0.0004 (0.003) -0.001 (0.003)
## R_FULL_BTH 0.044*** (0.010) 0.046*** (0.010) 0.047*** (0.010)
## R_HALF_BTH 0.024** (0.010) 0.018* (0.010) 0.020** (0.010)
## R_FPLACE 0.038*** (0.008) 0.044*** (0.008) 0.045*** (0.007)
## SaleSeasonSpring -0.063*** (0.023) -0.050*** (0.014) -0.051*** (0.014)
## SaleSeasonSummer 0.021 (0.019) 0.015 (0.012) 0.015 (0.011)
## SaleSeasonWinter -0.059*** (0.023) -0.047*** (0.013) -0.047*** (0.013)
## SaleMonthAug -0.007 (0.018)
## SaleMonthDec 0.060*** (0.023)
## SaleMonthFeb -0.051* (0.028)
## SaleMonthJan
## SaleMonthJul -0.006 (0.018)
## SaleMonthJun
## SaleMonthMar -0.020 (0.027)
## SaleMonthMay 0.070*** (0.024)
## SaleMonthNov -0.012 (0.022)
## SaleMonthOct 0.017 (0.021)
## SaleMonthSep
## StyleCape 0.237 (0.158)
## StyleColonial 0.681*** (0.196)
## StyleConventional 0.488** (0.196)
## StyleDecker 0.032 (0.047)
## StyleDuplex 0.884*** (0.285)
## StyleRaised Ranch -0.078* (0.047)
## StyleRanch 0.030 (0.115)
## StyleRow End 0.522** (0.236)
## StyleRow Middle 0.502** (0.241)
## StyleSemi Det 0.518** (0.238)
## StyleSplit Level -0.001 (0.070)
## StyleTri Level -0.157 (0.162)
## StyleTudor 0.628** (0.253)
## StyleTwo Fam Stack 0.441** (0.212)
## StyleVictorian 0.896*** (0.208)
## ZIPCODE -0.0003 (0.0003)
## PTYPE -0.042*** (0.016) -0.042*** (0.016) -0.042*** (0.016)
## OWN_OCC 0.004 (0.010)
## LUR1 -37.208*** (14.019) -36.867** (14.308) -36.898*** (14.273)
## LUR2 -37.017*** (13.971) -36.738** (14.260) -36.769*** (14.224)
## LUR3 -36.927*** (13.956) -36.641** (14.244) -36.673*** (14.209)
## STRUCTURE_C 0.134 (0.172)
## STRUCTURE_R -0.042 (0.066)
## R_BLDG_STYCL -0.650*** (0.195)
## R_BLDG_STYCP -0.224 (0.156)
## R_BLDG_STYCV -0.516*** (0.195)
## R_BLDG_STYDK
## R_BLDG_STYDX -0.964*** (0.289)
## R_BLDG_STYRE -0.513** (0.236)
## R_BLDG_STYRM -0.451* (0.240)
## R_BLDG_STYRN -0.010 (0.115)
## R_BLDG_STYRR
## R_BLDG_STYSD -0.514** (0.236)
## R_BLDG_STYSL
## R_BLDG_STYTF -0.474** (0.210)
## R_BLDG_STYTL
## R_BLDG_STYVT -0.765*** (0.211)
## R_ROOF_TYPG 0.010 (0.019)
## R_ROOF_TYPH 0.037* (0.022)
## R_ROOF_TYPL 0.022 (0.033)
## R_ROOF_TYPM 0.003 (0.024)
## R_ROOF_TYPS 0.023 (0.095)
## R_EXT_FINB 0.092*** (0.027)
## R_EXT_FINC 0.063 (0.065)
## R_EXT_FINF 0.042* (0.024)
## R_EXT_FINM 0.0001 (0.020)
## R_EXT_FINP -0.061* (0.036)
## R_EXT_FINS -0.054 (0.057)
## R_EXT_FINU 0.139 (0.112)
## R_EXT_FINV 0.158 (0.114)
## R_EXT_FINW 0.014 (0.021)
## HEAT_SYS -0.017 (0.039)
## C_AC 0.057*** (0.016) 0.059*** (0.016) 0.060*** (0.015)
## FeetToParks -0.00000 (0.00001)
## NeighborhoodBack Bay 0.998*** (0.151) 1.025*** (0.149) 1.072*** (0.140)
## NeighborhoodBay Village 0.532*** (0.180) 0.590*** (0.181) 0.590*** (0.180)
## NeighborhoodBeacon Hill 0.670*** (0.102) 0.701*** (0.097) 0.670*** (0.090)
## NeighborhoodBrighton -0.178* (0.093) -0.166* (0.094) -0.164* (0.093)
## NeighborhoodCharlestown 0.046 (0.082) 0.077 (0.075) 0.077 (0.072)
## NeighborhoodDorchester -0.251*** (0.075) -0.199*** (0.073) -0.195*** (0.072)
## NeighborhoodDowntown 0.244 (0.184) 0.287 (0.183) 0.291 (0.182)
## NeighborhoodEast Boston -0.330*** (0.081) -0.284*** (0.077) -0.281*** (0.076)
## NeighborhoodFenway 0.461*** (0.153) 0.456*** (0.151) 0.361*** (0.136)
## NeighborhoodHyde Park -0.245*** (0.081) -0.198** (0.077) -0.197*** (0.076)
## NeighborhoodJamaica Plain -0.072 (0.074) -0.039 (0.072) -0.036 (0.071)
## NeighborhoodMattapan -0.273*** (0.080) -0.221*** (0.078) -0.216*** (0.077)
## NeighborhoodMission Hill 0.064 (0.083) 0.110 (0.081) 0.113 (0.080)
## NeighborhoodRoslindale -0.210*** (0.076) -0.181** (0.075) -0.178** (0.074)
## NeighborhoodRoxbury -0.246*** (0.078) -0.186** (0.076) -0.180** (0.074)
## NeighborhoodSouth Boston 0.001 (0.082) 0.004 (0.077) 0.005 (0.074)
## NeighborhoodSouth End 0.638*** (0.096) 0.713*** (0.087) 0.690*** (0.084)
## NeighborhoodWest Roxbury -0.167** (0.079) -0.134* (0.078) -0.132* (0.077)
## WalkScore
## TransitSco
## BikeScore
## FeetToMetro -0.00001*** (0.00000) -0.00001*** (0.00000) -0.00001*** (0.00000)
## MEDHHINC 0.00000 (0.00000)
## PCTBACHMOR 0.002*** (0.0005) 0.002*** (0.0005) 0.002*** (0.0005)
## PCTOWNEROC 0.00001 (0.0003)
## PCTWHITE 0.003*** (0.0004) 0.003*** (0.0004) 0.003*** (0.0004)
## CrimeIndex -0.017*** (0.005) -0.018*** (0.005) -0.019*** (0.005)
## NearUni 0.042** (0.018) 0.037** (0.018) 0.038** (0.018)
## SchoolGrade 0.003 (0.003)
## DistToCBD -0.00000 (0.00000)
## NearImpBldg -0.141** (0.065) -0.094 (0.064)
## NearCommonwealth 0.122 (0.099) 0.093 (0.101)
## DistToPoor 0.00000 (0.00000)
## NearAP 0.106*** (0.036) 0.099*** (0.036) 0.100*** (0.035)
## Dist_SC -0.00002** (0.00001) -0.00002** (0.00001) -0.00002** (0.00001)
## LogLAND_SF 0.101*** (0.013) 0.099*** (0.013) 0.098*** (0.013)
## LogLIVING_ARE 0.313*** (0.026) 0.312*** (0.026) 0.310*** (0.022)
## LogPCTVACANT 0.008 (0.011) 0.002 (0.011)
## LogDist_Major_Road 0.001 (0.006) -0.002 (0.006)
## LogAve_SalePr 0.238*** (0.032) 0.246*** (0.032) 0.245*** (0.031)
## Constant 48.032*** (15.664) 46.924*** (15.971) 47.056*** (15.920)
## ----------------------------------------------------------------------------------------------------------
## Observations 1,323 1,323 1,323
## R2 0.897 0.884 0.883
## Adjusted R2 0.888 0.879 0.880
## Residual Std. Error 0.154 (df = 1216) 0.160 (df = 1274) 0.159 (df = 1280)
## F Statistic 99.929*** (df = 106; 1216) 201.689*** (df = 48; 1274) 231.061*** (df = 42; 1280)
## ==========================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
step$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## log(SalePrice) ~ Last_Work + NewlyRemodeled + NUM_FLOORS + R_FULL_BTH +
## R_HALF_BTH + R_FPLACE + SaleSeason + PTYPE + LU + C_AC +
## Neighborhood + FeetToMetro + PCTBACHMOR + PCTWHITE + CrimeIndex +
## NearUni + NearAP + Dist_SC + LogLAND_SF + LogLIVING_ARE +
## LogAve_SalePr
##
## Final Model:
## log(SalePrice) ~ Last_Work + NewlyRemodeled + NUM_FLOORS + R_FULL_BTH +
## R_HALF_BTH + R_FPLACE + SaleSeason + PTYPE + LU + C_AC +
## Neighborhood + FeetToMetro + PCTBACHMOR + PCTWHITE + CrimeIndex +
## NearUni + NearAP + Dist_SC + LogLAND_SF + LogLIVING_ARE +
## LogAve_SalePr
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 1280 32.48998 -4817.998
The following section will test the assumptions associated with the OLS model (ie. Residual normality, heteroscedasticity). Additionally, I will evaluate whether the model’s residuals exhibit significant spatial autocorrelation (indicating unaccounted-for spatial patterns).
Reg_Dataframe <- cbind(reg3$residuals,reg3$fitted.values)
Reg_Dataframe <- as.data.frame(Reg_Dataframe)
colnames(Reg_Dataframe) <- c("residuals", "predictedValues")
ggplot(reg3, aes(Reg_Dataframe$residuals)) + geom_histogram(bins=25) +
labs(x="Residuals",
y="Count")
Residuals look good!
Predicted as function of residuals:
ggplot(data = Reg_Dataframe, aes(x = residuals , y = predictedValues)) +
geom_point(size = 1) + xlab("Residuals") + ylab("Predicted Values") + ggtitle("Residual Values vs. Predicted Values") +
theme(plot.title = element_text(hjust = 0.5))
Predicted as function of observed:
regDF <- cbind(log(df_Train$SalePrice), reg3$fitted.values)
colnames(regDF) <- c("Observed", "Predicted")
regDF <- as.data.frame(regDF)
ggplot() +
geom_point(data=regDF, aes(Observed, Predicted)) +
stat_smooth(data=regDF, aes(Observed, Observed), method = "lm", se = FALSE, size = 1) +
labs(title="Predicted Price as a function\nof Observed Price") +
theme(plot.title = element_text(hjust = 0.5))
Looks sufficiently homoscedastic!
reg_residuals <- data.frame(reg3$residuals)
LonLat <- data.frame(df[df$SalePrice>0,]$Longitud_1, df[df$SalePrice>0,]$Latitude_1)
residualsToMap <- cbind(LonLat, reg_residuals )
colnames(residualsToMap) <- c("longitude", "latitude", "residual")
library(ggmap)
ggmap(baseMap_invert) +
geom_point(data = residualsToMap,
aes(x=longitude, y=latitude, color = residual),
size = 2) + scale_colour_gradient(low = "blue", high = "yellow") + mapTheme() +
labs(title="Prediction Residuals (Per Observation)")
library(ggmap)
Raster <-
ggmap(baseMap_invert) +
stat_summary_2d(geom = "tile",
bins = 80,
data=residualsToMap,
aes(x = longitude, y = latitude, z = ntile(residual,5))) +
scale_fill_gradient(low = "yellow", high = "blue",
guide = guide_legend(title = "Residuals \n (Quintiles)")) +
labs(title="Prediction Residuals (Raster Grid)") + mapTheme()
Raster
No immediately noticeable spatial pattern in residuals.
library(spdep)
coords <- cbind(df[df$SalePrice>0,]$Longitud_1, df[df$SalePrice>0,]$Latitude_1)
spatialWeights <- knn2nb(knearneigh(coords, 4))
moran.test(reg1$residuals, nb2listw(spatialWeights, style="W"))
##
## Moran I test under randomisation
##
## data: reg1$residuals
## weights: nb2listw(spatialWeights, style = "W")
##
## Moran I statistic standard deviate = 0.16468, p-value = 0.4346
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.0022273746 -0.0007564297 0.0003282955
Test results indicate that no significant spatial autocorrelation is present in the residuals.
library(caret)
library(stargazer)
inTrain <- createDataPartition(
y = df_Train$Neighborhood,
p = .75, list = FALSE)
IST.training <- df_Train[ inTrain,] #the in-sample training set
IST.test <- df_Train[-inTrain,] #the in-sample test set
reg4 <- lm(log(SalePrice) ~ ., data = IST.training%>% #regression with in-sample training data
as.data.frame %>% dplyr::select(-UniqueSale, -LogDist_Major_Road, -PCTOWNEROC, -DistToPoor,
-DistToCBD, -SchoolGrade, -MEDHHINC, -WalkScore, -TransitSco,
-BikeScore, -FeetToParks, -HEAT_SYS, -R_EXT_FIN, -R_BLDG_STY, -R_ROOF_TYP, -STRUCTURE_,
-OWN_OCC, -ZIPCODE, -Style, -SaleMonth, -YR_BUILT, -R_TOTAL_RM, -NearImpBldg, -NearCommonwealth,
-NearCommonwealth, -LogPCTVACANT, -LogDist_Major_Road))
#predict on in-sample test set
reg4Pred <- predict(reg4, IST.test)
reg4PredValues <-
data.frame(observedPrice = IST.test$SalePrice,
predictedPrice = exp(reg4Pred))
#store predictions, observed, absolute error, and percent absolute error
reg4PredValues <-
reg4PredValues %>%
mutate(error = predictedPrice - observedPrice) %>%
mutate(absError = abs(predictedPrice - observedPrice)) %>%
mutate(percentAbsError = abs(predictedPrice - observedPrice) / observedPrice)
stargazer(reg4PredValues, type = 'text')
##
## ======================================================================
## Statistic N Mean St. Dev. Min Max
## ----------------------------------------------------------------------
## observedPrice 323 619,531.800 447,474.900 245,000 3,900,000
## predictedPrice 323 608,338.200 421,368.000 260,397.700 3,890,731.000
## error 323 -11,193.530 139,614.100 -968,105.200 579,433.700
## absError 323 84,403.390 111,676.900 332.755 968,105.200
## percentAbsError 323 0.131 0.111 0.001 0.700
## ----------------------------------------------------------------------
fitControl <- trainControl(method = "cv", number = 10)
set.seed(825) #set seed for random number generator
lmFit <- train(log(SalePrice) ~ ., data = df_Train,
method = "lm",
trControl = fitControl)
lmFit$resample
## RMSE Rsquared MAE Resample
## 1 0.1687018 0.8878129 0.1122363 Fold01
## 2 0.2279301 0.8348478 0.1483443 Fold02
## 3 0.1721422 0.8708535 0.1240672 Fold03
## 4 0.1590075 0.8361266 0.1220684 Fold04
## 5 0.1833931 0.8188797 0.1349571 Fold05
## 6 0.1714996 0.8142559 0.1328334 Fold06
## 7 0.1745891 0.8882076 0.1360352 Fold07
## 8 0.1678816 0.8672234 0.1254778 Fold08
## 9 0.1634351 0.8559889 0.1229898 Fold09
## 10 0.1546993 0.8748805 0.1234529 Fold10
library(stargazer)
stargazer(lmFit$resample, type = "text")
##
## =======================================
## Statistic N Mean St. Dev. Min Max
## ---------------------------------------
## RMSE 10 0.174 0.020 0.155 0.228
## Rsquared 10 0.855 0.027 0.814 0.888
## MAE 10 0.128 0.010 0.112 0.148
## ---------------------------------------
The plot below shows the distibution of mean absolute error (MAE) values for the 10 folds.
#Evaluating Generalizeability: Fold MAE Frequency Plot
ggplot(as.data.frame(lmFit$resample), aes(MAE)) +
geom_histogram(bins=5) +
labs(x="Mean Absolute Error",
y="Count")
No significant outliers are present
The plot below shows the per-fold R-Square statistic.
CVFolds <- cbind(lmFit$resample$Resample, lmFit$resample$Rsquared)
colnames(CVFolds) <- c("Fold", "RSquared")
CVFolds <- as.data.frame(CVFolds)
CVFolds$RSquared <- as.numeric(as.character(CVFolds$RSquared))
CVFolds$RSquared <- formatC(CVFolds$RSquared,digits=2,format="f")
CVFolds$RSquared <- as.numeric(as.character(CVFolds$RSquared))
ggplot(CVFolds, aes(x=Fold, y=RSquared)) +
geom_bar(stat="identity", fill = "#4682b4") + scale_y_continuous(limits = c(0, 1)) + geom_text(aes(label=RSquared), vjust=-1) +
theme(axis.text.x=element_text(angle=45, hjust=1))
Since all folds have a similar R-Squared statistic, the model seems sufficiently generalizable.
The following code uses the chosen model to predict the sale prices of observations in the test set.
FinalReg <- lm(log(SalePrice) ~ ., data = df_Train%>%
as.data.frame %>% dplyr::select(-UniqueSale, -LogDist_Major_Road, -PCTOWNEROC, -DistToPoor,
-DistToCBD, -SchoolGrade, -MEDHHINC, -WalkScore, -TransitSco,
-BikeScore, -FeetToParks, -HEAT_SYS, -R_EXT_FIN, -R_BLDG_STY, -R_ROOF_TYP, -STRUCTURE_,
-OWN_OCC, -ZIPCODE, -Style, -SaleMonth, -YR_BUILT, -R_TOTAL_RM, -NearImpBldg, -NearCommonwealth,
-NearCommonwealth, -LogPCTVACANT, -LogDist_Major_Road))
FinalPred <- predict(FinalReg, df_Test)
FinalPredValues <-
data.frame(UniqueSale = df_Test$UniqueSale,
PredictedPrice = exp(FinalPred))
head(FinalPredValues)
## UniqueSale PredictedPrice
## 3 3 625807.1
## 20 20 384205.3
## 44 44 562834.1
## 51 51 432378.7
## 76 76 471795.3
## 82 82 595516.4
library(ggmap)
LonLat_Test <- data.frame(df[df$SalePrice==0,]$Longitud_1, df[df$SalePrice==0,]$Latitude_1)
PredictionsToMap <- cbind(FinalPredValues, LonLat_Test)
colnames(PredictionsToMap) <- c("UniqueSale", "PredictedPrice", "longitude", "latitude")
library(mosaic)
library(RColorBrewer)
cols <- colorRampPalette(brewer.pal(5,"RdYlGn"))(5)
ggmap(baseMap_invert) +
geom_point(data = PredictionsToMap, aes(x=longitude, y=latitude, color= ntiles(PredictedPrice, n = 5)), size = 3) +
mapTheme() + theme(legend.position="bottom") +
scale_color_manual(name ="Predicted Prices (Quintiles)", values = c(cols))
NewlyRemodeled - Whether the unit was remodeled since 2005 (binary variable)
GROSS_AREA - Gross floor area of the unit
NUM_FLOORS - Number of floors
R_TOTAL_RM - Total number of rooms
R_BDRMS - Number of bedrooms
R_FULL_BTH - Number of full bathrooms
R_HALF_BTH - Number of half bathrooms
R_KITCH - Number of kitchens in the structure
R_FPLACE - Number of fireplaces in the structure
SaleSeason - Season in which the sale took place. Summer= Jun-Aug, Fall = Sep-Nov, Winter = Dec-Feb, Spring= Mar-May
Style - Architectural style
LU - City’s land use designation
R_ROOF_TYP - Roof structure type: F Flat L Gambrel S Shed G Gable M Mansard H Hip O Other
R_EXT_FIN - Exterior finish type: A Asbestos K Concrete U Aluminum B Brick/Stone M Vinyl V Brick/Stone Veneer C Cement Board O Other W Wood Shake F Frame/Clapboard P Asphalt G Glass S Stucco
C_AC - Presence of central air-conditioning (binary)
FeetToParks - Distance to nearest park (feet)
FeetToMetro - Distance to nearest transit stop (feet)
MEDHHINC - Median household income of census block group
PCTBACHMOR - Percent of population in block group with bachelor’s degree or more
PCTWHITE - Percent of population in block group which identify as white
CrimeIndex - Crime ranking based on density of violent crime occurrences in 2015 (1-6)
NearUni - within one kilometer of university (binary)
SchoolGrade - ranking of nearest public school (1-9) from greatschools.com
DistToCBD - Distance to Central Business District (feet)
NearImpBldg - Whether near important landmark/building (binary)
NearCommonwealth - whether near commonwealth avenue or Boston commons (binary)
DistToPoor - Distance to neighborhoods with median household income less than 25k
NearAP - whether within 1500 feet of Logan Airport
Ave_SalePr - Average sale price of 5 nearest homes
Dist_SC - Distance to public schools (feet)
Dist_Major_Road - Distance to road with speed limit over 35 mph (feet)
LivingArea - Net living area in unit in feet (logged in model)
LAND_SF - Size of lot in feet (logged in model)