All of the code for this project can be found on my GitHub
Accessing housing market data can be surprisingly difficult. MLS listings are restricted to agents and brokers, and county websites only list sold homes, and there is not a great real time way to access the data. While platforms like Zillow and Redfin offer insights into real estate trends, they often restrict direct downloads, making large-scale analysis challenging. To work around this limitation, I built a web scraper using Python’s requests and BeautifulSoup libraries to collect data on recently sold homes. This refreshed my web scraping skills and got me thinking critically about feature selection for future regression analysis.
Redfin allows users to filter by location, sale status, and time frame, but there’s no easy way to export that data. Instead, I programmatically navigated Redfin’s search feature, capturing the HTML from each results page using requests. Once I had the raw HTML, the next step was to identify and extract the meaningful information. Using browser developer tools, I inspected Redfin’s pages and looked for where the data was useful and structured in a repeated way.
I found around 30 fields that seemed fairly consistent for homes in my zip code, so I saved them to a DataFrame for all the homes in my zip code using BeautifulSoup. I then saved the DataFrame to a csv for further cleaning.
Raw web-scraped data often contains inconsistencies, missing values, and formatting issues that must be addressed. For this dataset, I had to consider missing values, validating inputs, and considering overall data quality.
One of the first issues I encountered was inconsistent formatting in numeric fields like price and square footage. Since these values were extracted as strings, I converted them to float values for easier manipulation and analysis. This involved:
Removing currency symbols and commas from price fields ($350,000 → 350000.0)
Converting square footage to numerical values (1,500 sqft → 1500.0)
Not all listings contained complete information, but I noticed that legitimate home listings consistently included:
Price
Number of Bedrooms
Number of Bathrooms
Year Built
Square Footage
Any row missing these critical details was likely incomplete or a non-standard listing (e.g., land sales, auctions, or commercial properties). To maintain dataset integrity, I removed any record that lacked one or more of these fields.
The dataset included a column for home style, which categorized properties as Single Family Residential, Townhome, Condo, or other types. Since I had already filtered my Redfin search to include only these three categories, any additional home types appearing in the data were likely errors. To maintain consistency, I removed records that did not belong to one of these categories.
To better understand the structure of my dataset, I printed summary statistics for each column.
This helped me assess how messy the data was and start thinking about potential feature engineering opportunities. For example, seeing a high number of unique values in the "Year Built" column might suggest grouping homes into age ranges, while missing values in certain non-critical fields (e.g. parking total) might inform how I handle imputation.
Finally, I saved the cleaned data to a new csv.
Before fitting the regression model, I plotted the distribution of the numerical variables to find any outliers. I found that for price and year built, and there were a handful of outliers that I thought might distort the regression. Since I wasn’t interested in purchasing a historic home or an extremely expensive one, I removed these outliers to improve model accuracy.
I also plotted each independent variable against price to confirm a linear relationship.
At first look, I thought the relationships were fairly similar. However, after reading other blogs about predicting price, I learned that it was fairly common to perform a log transformation on price.
I decided to perform the linear regression using price and log transformed price. The transformation resulted in a bit of a better linear relationship using log price (adjusted r square of .83 for log price vs .80 for price). However, this is something I would want to test for subsequent zip codes.
independent variables vs price
independent variables vs log price
After creating dummy variables, I used the Variance Inflation Factor (VIF) to determine whether the independent variables were correlated with one another. Unfortunately for this dataset, the variables are highly correlated. Even after removing multiple variables include "baths", "senior community", and "new construction" due to high VIF, the VIF remains extremely high. This makes sense, because larger homes are going to have more bedrooms, and newer homes are also larger.
To address some of the multicollinearity, I removed the "baths" feature because I assumed that this information is largely represented by both beds and sqft already. Other features were removed because they ended up being insignificant after creating the regression.
Now we get to the fun part! I split the data into train and test, and created my first linear regression using sklearn. I was actually quite happy with my initial adjusted r square value of .83 for the test data set, although the train dataset was .92, so I knew there was a fair bit of overfitting. This is almost certainly due in part to the multicollinearity.
After reviewing the p values for each of the features, I removed view_yn, senior_community_yn, and new_construction_yn. This had some impact on adjusted r square, but traditional r square remained similar. The good news is that residual plots were relatively linear and normally distributed.
Residuals plot
Residuals PDF
I read a few other notebooks on Kaggle in an attempt to see how other folks had reduced multicollinearity in their datasets. One common method was using a ridge regression, so I tried that next. Ridge regression differs from ordinary least squares linear regression by adding an error term that penalizes large coefficients. The end result is that coefficients are closer to 0 which can reduce multicollinearity.
In fact, the ridge regression (v4 notebook if you're following along in GitHub) yielded an adjusted r square of .914 for the train dataset and .946 for test! 🎉 The best result was using ridge regression and log transformed price.
I created a dataframe to hold the price predictions for all homes, including the % difference and a confidence interval for predicted price and saved the information to a csv. To my disappointment, the confidence intervals are quite large, probably because the standard deviation of home prices are high, so my original intention of creating an alerting system when underpriced homes are listed is probably not worth the squeeze. Not to mention the fact that list prices are a strategy agents often use to influence sales price (e.g. starting bidding wars)
I also decided to train and test the best results on a neighboring zip code to see how portable the model might be. Unfortunately, adjusted r square suffered, indicating that the methods of optimizing the linear regression were too specific to the zip code I was first testing on. This prompted me to do some reading on how Zillow and Redfin have built their price estimators. It seems that they actually model the behavior of most listing agents: collect data on comps and determine pricing based on home the home compares to recently sold homes of a similar class. Maybe multiple linear regression is not the best tool for this job...
One of my biggest takeaways from this project was that home pricing datasets are difficult:
Data collection for non-agents is difficult, and even MLS data is relatively unstructured
There is a lot of multicollinearity
There are many variables that are not accurately captured, including recent remodels, the condition of the home, whether it is on a noisy street, and whether it has a view.
For these reasons, I don't think that multiple linear regression is actually the best way to answer this question. I'm looking forward to learning additional data science techniques and coming back to this problem with more sophistication, but for now it was a great way to brush up on web scraping, data cleaning, and linear regression.
If I do revisit this problem in the future, I would consider including school district as a feature, as well as experiment with using image recognition on photos that may be posted about the house to somehow incorporate a measure of quality and condition of the home. If you have any ideas about how to improve my analysis or additional techniques, send your feedback my way, and thank you for reading!