The golden rule in real estate investing is, “You make your money when you buy, not when you sell.” This means that your purchase price is the main factor that determines your profit later on. In order to quantitively analyze a sound real estate investment, I applied log-linear regression on housing data to develop a deterministic home pricing model. In this example, I look at residential houses in a few housing developments within Sacramento, CA.
To find out what properties are undervalued in a pool of options, we would need to know the value of the real estate asset. The real estate value has a slightly different meaning if you ask a homeowner, homebuyer, appraiser or tax assessor. However, an online valuation from the three main players Zillow’s Zestimate, Redfin’s Estimate and Realtor’s Estimate can be useful in providing an estimate of how much a house in question is worth.
The table below summarizes the real estate value/home value for a selected property within the development for all three websites. The median error rate for off-market properties in Sacramento County is 4.77 percent. That is, the selling price, compared with the estimate, is within that margin of error half the time. Off-market housing types are typically more challenging to value because there’s usually less detailed information available on them.
I decided to use Realtor’s estimates for several reasons. First, Realtor’s estimates appears to be generated by utilizing the most recent median home value estimate from AVMs (automated valuation models) provided by three different companies, Collateral Analytics, CoreLogic and Quantarium, independent modeling techniques which are, in turn, generally used by top lenders and insurance companies to make mortgage and property coverage decisions.
Second, although Realtor does not directly provide model error estimates, the Realtor valuations for the few properties in Sacramento, that were analyzed are closer in line with Redfin’s estimates than Zillow’s and since we can see that Redfin has more accurate estimates than Zillow for the county/city being analyzed, we can infer a comparatively higher level of estimate accuracy for Realtor. In addition, Realtor aggregates nearby home values which allows one to do a comparative valuation analysis of similar homes that a potential homebuyer would consider. The tool also aggregates key variables which includes the number of bedroom(s), number of bath(s), sq ft and lot.
I utilized a free online OCR service, to convert a screengrab of “Nearby Home Values” feature(NHV) from Realtor to a usable Excel format, one screengrab would typically include 21 comparable homes. For a more robust model, additional explanatory variable can be included, for example, I was able to manually include the age of the house variable based on the year built collected from Realtor. In addition, the sample size can be easily expanded utilizing Realtor’s “Similar Homes For Sale” feature, which would allow you to include homes that are available for sale in the data and capture NHV for those similar homes for sale. To ensure data accuracy, I found that the bathroom data in particular needed to be cleaned up as those values are usually rounded up. It’s also important to keep certain elements constant in the sample being assessed, the neighborhood should be similar and the driving distance between each house in the sample should be relatively short. This ensures that known explanatory variables such as school districts, crime rates and convenient access to amenities are held constant.
The table below shows a summary of the descriptive statistics of the sample of 69 houses and scatter plots.
The deterministic pricing model is shown below.
The model can be applied to on-market properties in the neighborhood to determine what is undervalued/overvalued, as well as to determine what in the sample address is undervalued. It’s important to note that a market asset’s home value estimate on Realtor is equivalent to it’s listed price. This is not the case with Redfin or Zillow. Thus, I utilized an aggregated online estimate, which is a weighted average of the three online home value estimates. This is then compared to the model predicted price.
This exercise also highlighted some key issues with the data. The first being that for some home address within the In Sample( In_n), Realtor appears to have two significantly different estimates for example, while the screengrab NHV indicates that In_3 had a Realtor estimate of $443,000, the direct homepage for the asset indicates an estimated value of $477,700, which is more in line with both Redfin and Zillow’s estimate. This skews both the regression model and the indication of undervalued vs overvalued for each home address. Another thing, to note is that the age coefficient is positively correlated with the house prices, I am attributing to poor data selection, where some newer 2 bedroom outliers in the same neighborhood with HOA fees are skewing the model output. This results in houses that are significantly older, for example, Out_2, having much higher predicted value than expected.