Project Goal

For this project, I plan to complete a linear regression analysis to create a model in order to analyze how different housing characteristics affect the price of a housing unit. This will include:

With this goal, the price of the housing unit will be our response variable, and the charateristics of the housing units will be our explanatory variables.

While completing this analysis, many different R functions will be used, such as functions to visualize and investigate variables, functions to update and subset data, functions to fit models, and functions to create plots.

Data used in this project

I currently work in survey operations for several housing surveys, which has led me to be extremely interested in datasets of housing information. This led me to select housing data for use in this project.

The dataset used is from Kaggle, and the format of the data is that there is one row for each housing unit in the sample. Each row contains information related to housing characteristics of the unit, such as number of bedrooms, number of stories in the unit, and if there is air conditioning.

The complete column list is as follows:

Note: There are two additional variables (prefarea and furnishingstatus) that I removed from the dataset prior to the analysis since I was not interested in these variables.

Analysis

Now that we have the goal of the project and knowledge of the dataset we are using, we are able to start the analysis with reading in the data.

Load Relevant Pacakges and Read in the Data

We will first use library() to load and attach the two relevant packages that we will be using functions from. We load the readr package so that we can use read_csv() to read in the data, and we load the DataExplorer package to have access to some helpful EDA functions. If you do not have either of these packages installed, you will first need to run install.packages(“PackageOfInterest”).

library(readr)
library(DataExplorer)

Now that we have the relevant packages attached, we can read in our dataset that we will use for this analysis. I pointed my working directory of this session to the location of the dataset I would like to read in. I can then use read_csv to import the data in. Once the data is loaded, str() can be used for a quick glance to make sure there were no issues on import. I am also going to set the give.attr argument in str() to FALSE in order to suppress extra output that shows attributes as substructures since that is not what we are interested in viewing.

setwd("~/Desktop/Stat 484:485/Stat 485 Project")
Housing <- read_csv("Housing.csv")
## Rows: 545 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): mainroad, guestroom, basement, hotwaterheating, airconditioning, pr...
## dbl (6): price, area, bedrooms, bathrooms, stories, parking
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(Housing,give.attr=FALSE)
## spec_tbl_df [545 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ price           : num [1:545] 13300000 12250000 12250000 12215000 11410000 ...
##  $ area            : num [1:545] 7420 8960 9960 7500 7420 7500 8580 16200 8100 5750 ...
##  $ bedrooms        : num [1:545] 4 4 3 4 4 3 4 5 4 3 ...
##  $ bathrooms       : num [1:545] 2 4 2 2 1 3 3 3 1 2 ...
##  $ stories         : num [1:545] 3 4 2 2 2 1 4 2 2 4 ...
##  $ mainroad        : chr [1:545] "yes" "yes" "yes" "yes" ...
##  $ guestroom       : chr [1:545] "no" "no" "no" "no" ...
##  $ basement        : chr [1:545] "no" "no" "yes" "yes" ...
##  $ hotwaterheating : chr [1:545] "no" "no" "no" "no" ...
##  $ airconditioning : chr [1:545] "yes" "yes" "no" "yes" ...
##  $ parking         : num [1:545] 2 3 2 3 2 2 2 0 2 1 ...
##  $ prefarea        : chr [1:545] "yes" "no" "yes" "yes" ...
##  $ furnishingstatus: chr [1:545] "furnished" "furnished" "semi-furnished" "furnished" ...

Next, I need to remove the two variables mentioned in the data overview section above from the dataset since they are not of interest. We can use the indexing brackets and use a negative sign in front of the column numbers we would like to remove to remove the prefarea and furnishingstatus variables. We would like to remove all rows of this variable so we leave the first argument in the brackets blank, and pass in the column numbers we would like to remove to the second argument.

Housing<-Housing[,-c(12,13)]

In the output to str(Housing,give.attr=FALSE) above, we can see all of our yes/no variables are of character type. We will want to treat these as factors for purposes of the analysis, with two levels. We can use the as.factor() function to convert these columns to factor type. With the syntax below, we update the column of interest to factor form and then pass it back into the dataset.

Housing$mainroad<-as.factor(Housing$mainroad)
Housing$guestroom<-as.factor(Housing$guestroom)
Housing$basement<-as.factor(Housing$basement)
Housing$hotwaterheating<-as.factor(Housing$hotwaterheating)
Housing$airconditioning<-as.factor(Housing$airconditioning)
str(Housing)
## tibble [545 × 11] (S3: tbl_df/tbl/data.frame)
##  $ price          : num [1:545] 13300000 12250000 12250000 12215000 11410000 ...
##  $ area           : num [1:545] 7420 8960 9960 7500 7420 7500 8580 16200 8100 5750 ...
##  $ bedrooms       : num [1:545] 4 4 3 4 4 3 4 5 4 3 ...
##  $ bathrooms      : num [1:545] 2 4 2 2 1 3 3 3 1 2 ...
##  $ stories        : num [1:545] 3 4 2 2 2 1 4 2 2 4 ...
##  $ mainroad       : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ guestroom      : Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 2 2 ...
##  $ basement       : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 1 ...
##  $ hotwaterheating: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ airconditioning: Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 1 2 2 ...
##  $ parking        : num [1:545] 2 3 2 3 2 2 2 0 2 1 ...

Exploratory Data Analysis

Now that we have all our data loaded and updated, we can conduct some exploratory analysis into the data to determine if there are any missing values, look for any patterns, and start drawing insights.

We can first use the plot_intro() function from the DataExplorer package to gain some high level details about the dataset. From this plot below, we can see that we have a healthy mix of continuous and discrete columns, there are not any columns of only missing data, all rows are complete, and there are not any missing observations.

plot_intro(Housing)

Now that we know our data is complete, we can look at the variables themselves using the plot_histogram() and plot_bar() functions from the DataExplorer package. Both functions do exactly as their names imply. Plot_histogram() will display a histogram for all the continuous variables and plot_barplot() will display a barplot for all discrete variables:

plot_histogram(Housing)

plot_bar(Housing)

From the histograms, we can see that the price and area variables look to be right skewed, which indicates we might need to complete some sort of transformation on them when it comes to the analysis. All other variables appear to be discrete numeric variables.

From the barplots, we can see the distribution of the data for the 5 Y/N variables. We can see a majority of the units are by a mainroad, do not have a guestroom, do not have a basement, do not have hot water heating, and do not have air conditioning.

Fit the Linear Model

Now that we have done a little exploration into the data and have started to draw some insights, we can start with fitting the linear model. We can use the lm() function to fit the model.

  • The first argument this function takes is a formula for the model. Here our formula is response variable as a function of all the explanatory variables. The way we write this in the function is y-variable ~ x-variable1 + x-variable2 +...+ x-variableN.
  • The second argument this function takes that I am using is the data argument. This allows us to specify the dataframe that contains the variables so that we only need to specify the variables in the formula, not both the variables and the dataframe they come from.
  • Some other arguments of interest that we could useful depending on the data are subset, na.action, and method. The subset argument allows us to specify the observations in the dataset that should be used in the data fitting process. The na.action argument indicates what should happen when there are NAs present, such as if we should fail or just omit those observations. The method argument allows us to specify the statistical method to be used to fit the model.
  • There are additional useful arguments that can be used with this function, there are just the immediate ones to call out.

Here we will fit the model using lm(). Since we are interested in all of the characteristics of the housing unit and how they affect the price, the price is our response variable and all other columns will be explanatory variables. Therefore the formula we will pass into the function is price ~ all other columns.

The summary() function is a generic function the shows the high level summary of the fitted model when a model is passed in as the object. We can use that here too to get a sense of the model we fit:

model1<-lm(price~area+bedrooms+bathrooms+stories+mainroad+guestroom+basement+hotwaterheating+airconditioning+parking,data = Housing)
summary(model1)
## 
## Call:
## lm(formula = price ~ area + bedrooms + bathrooms + stories + 
##     mainroad + guestroom + basement + hotwaterheating + airconditioning + 
##     parking, data = Housing)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2964698  -636157   -29595   627644  5701633 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -413203.57  244400.86  -1.691 0.091482 .  
## area                   270.71      24.89  10.876  < 2e-16 ***
## bedrooms            137214.77   75390.58   1.820 0.069311 .  
## bathrooms           992706.66  107353.99   9.247  < 2e-16 ***
## stories             468964.63   66676.76   7.033 6.20e-12 ***
## mainroadyes         571638.85  146067.33   3.914 0.000103 ***
## guestroomyes        348747.12  136739.63   2.550 0.011036 *  
## basementyes         500993.36  112283.49   4.462 9.91e-06 ***
## hotwaterheatingyes  829053.38  231394.83   3.583 0.000371 ***
## airconditioningyes  900731.69  112129.80   8.033 6.13e-15 ***
## parking             294701.34   60576.45   4.865 1.51e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1111000 on 534 degrees of freedom
## Multiple R-squared:  0.6538, Adjusted R-squared:  0.6473 
## F-statistic: 100.8 on 10 and 534 DF,  p-value: < 2.2e-16

We can see all of the variables are relatively significant and many are highly significant (indicated by the * and . listed to the right of the p value column), but are we violating any assumptions with this model? As a reminder, the assumptions we are looking at with the residuals are that the residuals are normally distributed, there is no pattern to the residuals, and the residuals have the same variance. (Note-There are other assumptions such as an assumption of linear relationships that are assumed for this project.)

To check these assumptions we can pass our model we fit above into the plot() function. I am using the which argument to tell the function I only want the first and second plots, which are the two plots we will use to look at the assumptions mentioned above.

plot(model1,which=1:2)

With the Residuals vs. Fitted plot, we are looking for no pattern in the residuals. However, we see that there is a small curve in the data that indicates there is a sort of pattern. The variance also appears to get larger as you move from the left to the right of the graph, which would violate the assumption of equal variance.

With the Normal QQ plot, we are looking for the data to fall on the line, or very close to it. However, we see here in this plot that the data at the tails deviates far from the line and thus the residuals don’t appear to be normal.

Transform the data

Since we noticed the price variable was right skewed, we can try a log transformation to see if this improves the residual plots. We can do this by using the log() function and replacing price with log(price) in the model. We can then check the plot of the residuals like we did above to check assumptions.

model2<-lm(log(price)~area+bedrooms+bathrooms+stories+mainroad+guestroom+basement+hotwaterheating+airconditioning+parking,data = Housing)
plot(model2,which=1:2)

The Residuals vs. Fitted plot looks a lot better here. There is no longer that curve in the data, and the residuals appear to be random. The spread of the data also appears to be consistent from left to right. I believe it is reasonable to assume the assumptions are met here.

The QQ Plot appears to be better as well. There is still a little departure from the lines at the tails, but overall this looks better. To be sure, we can check the normality of the residuals further by plotting a histogram of the residuals using the hist() function, and can call the residuals from the model by using model2$residuals.

hist(model2$residuals)

The residuals look relatively normal, with possibly a slight tail on the left. However I believe it is reasonable to assume the assumptions are met here.

Since the assumptions appear to be met, we can look at our model and draw some insights. Again, we can use summary(model2) to view information about this model we fit:

summary(model2)
## 
## Call:
## lm(formula = log(price) ~ area + bedrooms + bathrooms + stories + 
##     mainroad + guestroom + basement + hotwaterheating + airconditioning + 
##     parking, data = Housing)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.70458 -0.12884  0.01778  0.14735  0.63933 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.425e+01  4.831e-02 294.934  < 2e-16 ***
## area               5.492e-05  4.920e-06  11.163  < 2e-16 ***
## bedrooms           3.578e-02  1.490e-02   2.401 0.016694 *  
## bathrooms          1.659e-01  2.122e-02   7.816 2.92e-14 ***
## stories            9.407e-02  1.318e-02   7.137 3.13e-12 ***
## mainroadyes        1.506e-01  2.887e-02   5.216 2.62e-07 ***
## guestroomyes       8.003e-02  2.703e-02   2.961 0.003205 ** 
## basementyes        1.226e-01  2.220e-02   5.523 5.22e-08 ***
## hotwaterheatingyes 1.649e-01  4.574e-02   3.605 0.000341 ***
## airconditioningyes 1.809e-01  2.217e-02   8.161 2.39e-15 ***
## parking            5.039e-02  1.197e-02   4.208 3.02e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2196 on 534 degrees of freedom
## Multiple R-squared:  0.6583, Adjusted R-squared:  0.6519 
## F-statistic: 102.9 on 10 and 534 DF,  p-value: < 2.2e-16

Here we can see that all of the variables in the model are significant. Many of the variables are even significant at a ~0 level of significance. Since we wouldn’t want to remove any of the variables, this is most likely what our final model would look like.

We have an R-squared value of 0.6519, which indicates that roughly 65% of the variation in the price can be explained byt this model. There is still room for improvement, but with all of our variables being highly significant, I don’t believe we would improve this fit with removing any variables.

Plotting components of the model

Now that we have the final model, I will take the three most significant variables and create a multipanel plot to analyze the area, airconditioning, and bathroom variables, and their affect on the price.

I would like to create one plot for each level of the number of bathrooms, and then fit the model log(price)~area for observations with air conditioning and observations without air conditioning and display the best fit regression line. So there will be a different plot for each level of bathrooms, and each plot will have two lines of best fit for the levels of air conditioning.

To find out values of bathroom we have in our data, we can use the tables() function to summarize the data in the bathrooms variable and list how many rows we have at each level:

table(Housing$bathrooms)
## 
##   1   2   3   4 
## 401 133  10   1

Here we can see there are 401 rows with 1 bathroom, 133 rows with 2 bathrooms, 10 rows with 3 bathrooms, and 1 row with 4 bathrooms. For purposes of our analysis we will remove the one row with 4 bathrooms since we wont be able to fit a reliable model for that level with the lack of data. Note- we may also not be able to have a reliable plot for the 3 bathroom level with only 10 data points, but for purposes of this analysis I will continue to fit it and can draw conclusions at the end.

We can use the which() function to find what row is the unit with 4 bathrooms. Once we have the row number, we can use this in conjunction with the indexing brackets to remove it from the dataset. Then we can rerun the tables function to ensure this row was dropped.

UpdHousing<-Housing[-which(Housing$bathrooms==4),]
table(UpdHousing$bathrooms)
## 
##   1   2   3 
## 401 133  10

The next step needed is to segment the data based on the number of bathrooms. This sub-setting will create three different datasets, where the first dataset includes all the rows from the original dataset with 1 bathroom, the second dataset includes all rows from the original dataset with 2 bathrooms, and the third dataset includes all the rows from the original dataset with 3 bathrooms. This step is done since I plan to refit and plot each subset of data, so having separate sets will make it cleaner.

We can use the subset function to complete this segmenting of the data. All we need to do is pass our datatable as the first argument, and our logical expression indicating what rows we want to keep as the second argument. We then assign that to a new object, and we have our subsetted data:

UpdHousing1<-subset(UpdHousing,UpdHousing$bathrooms==1)
UpdHousing2<-subset(UpdHousing,UpdHousing$bathrooms==2)
UpdHousing3<-subset(UpdHousing,UpdHousing$bathrooms==3)

Now that we have our subsetted data, we can fit a model for log(price)~area for each of our bathroom datasets.

In this lm(), I used the subset argument to even further segment the data. This argument takes an expression to specify what observations should be used in the fitting process. If we take the first lm() in the code chunk below for example, the subset argument is subset = airconditioning==“no”. Therefore for this model, we only consider observations where the airconditioning variable is set to no.

This extra layer of subsetting is done for the reason mentioned above- to have two different regression lines, one for the observations with air conditioning, and one for observations without air conditioning.

noairmodel1<-lm(log(price)~area,data = UpdHousing1,subset = airconditioning=="no")
airmodel1<-lm(log(price)~area,data = UpdHousing1,subset = airconditioning=="yes")

noairmodel2<-lm(log(price)~area,data = UpdHousing2,subset = airconditioning=="no")
airmodel2<-lm(log(price)~area,data = UpdHousing2,subset = airconditioning=="yes")

noairmodel3<-lm(log(price)~area,data = UpdHousing3,subset = airconditioning=="no")
airmodel3<-lm(log(price)~area,data = UpdHousing3,subset = airconditioning=="yes")

At this point, we have six different models, one for each combination of bathrooms (1,2,3) and air conditioning (Y,N). In all of the models, we are looking at log(price)~area.

We are now ready to plot the data. There are many functions in the next code chunk that will be broken down here before the plot code executes.

  • Layout()
    • The layout function provides a flexible way to split the plotting window when you would like to display multiple plots. The arguments of this function help dictate how to split up the plotting window, and how the plots should be displayed
    • The way I use layout below is to segment the plotting window into 4 rows, where row 1, 2, and 3 will all contain a plot, and the final row will contain space for an axis and text. I use the heights argument to specify the last row should be 1/2 the size of the other rows.
  • Par()
    • The par function is similar to the layout function in that it is used to set graphical parameters that control the plot output. There are numerous arguments that can be used here to do things such as set the background color(bg), specify what type of box should be drawn around the plot (bty), pick which font to use for text (font), set the margin size (mar), and the plot dimensions (pin). More arguments can be found in the help documentation.
    • I use par() to set the margins for my plot so that I can fit text around it, and to also set the type of plot lines we should draw to an L shape.
  • With()
    • The with function takes on the form with(dataframe, something you want to do). This function work similarly to how the data= argument works inside other functions, by specifying the dataframe you are working with. Therefore, if you need to call or reference a variable, you do not need to specify both the dataframe and variable, you only need to reference the variable.
    • I use the with() function below because the plot function doesn’t have a data argument to specify the dataframe I’m using. Therefore I use with() for readability since I don’t need to specify the dataframe I’m calling the variables from for each variable I use.
  • plot()
    • Plot() is a generic function that allows for plotting objects in R. When continuous x and y objects are passed into plot, a scatterplot is created. Many of the arguments you can you specify formatting of the plot, such as labels, colors, and point types.
    • I use this function to plot area vs. log(price)
  • abline()
    • This function adds a straight line to the current plot. You will pass in the intercept (a argument) and the slope (b argument), and then can also specify horizontal or vertical lines.
    • I use this function several times to add the best fit line to the graphs.
  • text()
    • The text() function allows us to add text to a plot. This function’s main arguments are those to specify where the text should go, as well as what the text should be.
    • I use text() here to add information about which plot is which when it comes to the number of bathrooms.
  • mtext()
    • Mtext() is very similar to text(), the main difference is that mtext() is used when you want to add text to the margins of a plot. The arguments are also similar in that the main arguments are those to specify where the text should go, as well as what the text should be.
    • I use Mtext() to specify the axis titles.
  • legend()
    • The legend function allows us to add legends to the plot. There are numerous arguments that allow for a ton of flexibility, such as specifying colors, line types, point types, legend text, and positioning of the legend. Reference the help file for all the arguments you can use with this function.
    • I use the legends argument twice, once to specify the point types, and once to specify the line types. I also use the positioning arguments to place these legends next to each other.
  • polygon()
    • The polygon function draw a polygon on the plot. You specify the x and y coordinates of where the vertices of the polygon should be, and the function will draw it on the graph. There are also other arguments for things like shading, color of the polygon, and type of border of the polygon
    • I use this function here to draw a box around both of my legends to make them look cohesive.
#We first need to set up the plotting space and make sure the layout is correct
layout(matrix(c(1, 2, 3, 0), nrow = 4), heights = c(1, 1, 1, 0.5))
par(mar = c(0.5, 5, 0.5, 1), bty = "l")

#This line sets the colors that will be used for the points
col=c("Burlywood3","Darkseagreen4")  
#This line sets the point types that will be used in the plot
points=c(21,22)

#This first chunk plots a scatterplot of area vs. log(price) for units with 1 bathroom. If observation has air conditioning or not is specified by the color and point type. Lines are added for the model without air conditioning and with air conditioning and text is used to specify what the plot represents. 
with(UpdHousing1,plot(x=area,y=log(price),col=col[airconditioning],pch=points[airconditioning],xaxt = "n",xlab = "",xlim = c(2000,16000), ylab = ""))
abline(noairmodel1,col="Burlywood3")
abline(airmodel1,col="Darkseagreen4")
text(x=2000,y=16, "One Bathroom", pos = 4, cex = 1.2)

#This second chunk plots a scatterplot of area vs. log(price) for units with 2 bathrooms. If the observation has air conditioning or not is specified by the color and point type. Lines are added for the model without air conditioning and with air conditioning and text is used to specify what the plot represents. 
with(UpdHousing2,plot(x=area,y=log(price),col=col[airconditioning],pch=points[airconditioning],xaxt = "n",xlab = "",xlim = c(2000,16000), ylab = ""))
abline(noairmodel2,col="Burlywood3")
abline(airmodel2,col="Darkseagreen4")
text(x=2000,y=16.25, "Two Bathrooms", pos = 4, cex = 1.2)

#Text is added for a y-axis label at this point so that it is in the middle of the plot
mtext(side = 2, line = 3, text = "Log(Price) in Dollars", cex = 1.5)

#Legends are added and boxed in here so that it is included in the second plot
legend(x=10500,y=15.2,legend = c("Without Air Conditioning","With Air Conditioning"),col=col,pch=points,bty="n")
legend("bottomright",legend = c("Best Fit Line Without Air Conditioning","Best Fit Line With Air Conditioning"),col=col,lty=c(1,1),bty="n")
polygon(x=c(10000,16500,16500,10000),y=c(14.65,14.65,15.15,15.15))

#This third chunk plots a scatterplot of area vs. log(price) for units with 3 bathrooms. If the observation has air conditioning or not is specified by the color and point type. Lines are added for the model without air conditioning and with air conditioning and text is used to specify what the plot represents. 
with(UpdHousing3,plot(x=area,y=log(price),col=col[airconditioning],pch=points[airconditioning],xlab="",xlim = c(2000,16000),ylab=""))
abline(noairmodel3,col="Burlywood3")
abline(airmodel3,col="Darkseagreen4")
text(x=2000,y=16, "Three Bathrooms", pos = 4, cex = 1.2)

#This text adds a label for the x-axis
mtext(side = 1, line = 3, text = "Area in Square Feet", cex = 1.25)

Conclusion

As we can see from the plots, in general as the area of the housing unit increases, the price also increases regardless of if there is or is not air conditioning, an no matter the number of bathrooms. There is one case where this pattern is not true, for the three bathroom plot we see a negative slope for units with air conditioning. This would indicate as area increases for three bathroom units with AC, price decreases. I believe that this is just due to only having 2 datapoints here, and having more data would change this slope to model what we see in the other plots.

We can also see that in general, having AC increases the price of the housing units. For the 1 and 2 bathroom plots, the regression line for the units with AC is larger than the regression line for units without AC. This would make sense since many people would most likely pay slightly more to have an air conditioned housing unit.

References

Data Source: https://www.kaggle.com/code/ashydv/housing-price-prediction-linear-regression/data

Other Relevant Information:

DataExplorer Package: https://cran.r-project.org/web/packages/DataExplorer/DataExplorer.pdf