How do you find the best value rental home?
- Leo Liu
- Oct 12, 2018
- 2 min read
Updated: Dec 12, 2018
Leo Liu
Objective
The objective of this project is to help customer who wants to rent a home find the best value property and budget their rental expenses.
Modeling Process
To achieve the goal, I built up a Linear Regression Model with dataset gathered from www.realtor.com to predict rental home's prices based on features of number of bedrooms, bathrooms, total square feet, year-built, number of rated schools nearby, average school rating, median rental price and median listing(selling) price in neighborhoods.
In the initial phase of modeling, feature screening was completed by examining the heatmap of correlation matrix (See Figure 1) for multicollinearity and checking the p-values of the simple OLS (Ordinal Least Square) model to determine which feature did not contribute to the responses (rental prices). As a result, features of Average School Rating and Median Listing Price in the neighborhood was taken out of the model due to high p-values, which show these features are unlikely to have an impact on the responses. Simultaneously, the multicollinearity problem was also addressed because features of Median Listing Price and Median Rental Price was highly correlated, which indicates having both of them will not give additional information about the responses than the situation of having only one of them.

Next, with a 2-degree polynomial feature and StandardScaler added to the original OLS model, there was an improvement in the R-squared from 0.55 to 0.79. However, the red flag of overfitting raised after doing a train test split examination. To investigate this case, I inspected the model's residual plot and the Normal Q-Q plot (See Figure 2) to check for heteroskedasticity. As we can see from the figures, chances are great that the prediction and some of the features are not normally distributed.

After examining the histogram of all features and target (rental price), I saw the rental price and feature of Square Feet are not normally distributed. After a transformation of the variables, the residual plot and Normal Q-Q plot looks in good shape (See Figure 3).

Moreover, the overfitting problem disappeared after the transformation. The R-squared value of training and testing set are 0.791 and 0.780, and the mean squared errors are
0.031 and 0.037, which implies a 1.6% deviation between train and test after transforming back to the original scale. The fitting result can be seen from the figure below (See Figure 4).

Tools
Data Collection: BeautifulSoup, Selenium
Data Analysis: statsmodel, sklearn, pandas
Presentation: Matplotlib, Seaborns.
Future Work
First, I would collect more data and incorporate more condos and single families to make my model more robust and generalizable. Plus, different models might be built upon different type of rental homes. Incorporating more features into my model is another task I will continue to work on.
Comments