Tutorial: How to construct a hedonic home price index.
Questions? E-mail us at info@globaldecision.com
Our goal is to provide a useful and clear tutorial. We value all constructive feedback and work to enhance the site based on user reader feedback.
Step 1: Grab some real estate data
Grab some real estate data. Redfin.com is a good source, as you can select actual sales data for a group of up to 500 homes. If this is your first hedonic model, we recommend selecting SFR's only. They tend to be more homogenous with respect to how their valuation relates to the underlying factors. In addition, we recommend using a geographic area that's as homogeneous as possible. Combining, for example, all of Costa Mesa, CA ZIP code 92627 will require an additional variable in the model. Because the "West Side" of Costa Mesa has substantially lower property values vs. the "East Side," a good hedonic analysis would need to specifically add a 1/0 factor to identify whether the property is "West Side" or "East Side." FYI: if you're joining us from outside of Orange county, CA, the East and West sides of Costa Mesa are split by CA-55 / Newport Ave.
Step 2: Clean up your dataset
Clean up your dataset. It's OK to remove a few percent of the data. Find any properties that are obviously out of line with reality. A good approach might be to filter out clearly mispriced homes. In the Irvine, CA area for the year 2011, home sales under $300,000 are clearly data errors, fraud, or unusual transactions. Homes sales over $2,000,000, while rare, are not representative of the market and also excluded. Homes with zero baths or zero beds are also good candidates for exclusion.
Step 3: Convert the date into regression-friendly format
Convert the date into a categorical element with discrete time values. A sales date is a core measure, but it's hard to work with because it can take on too many values (365 or 366 in just one year). For hedonic purposes, we recommend either using just the year component of the date (for longer term analysis) or a year-quarter combo (such as 2011-Q1). In Excel, it's simple to extract a year or month from a date using =YEAR() or =MONTH() functions. To convert a month into a quarter, build a simple 12 row table and use VLOOKUP(source, quarter_table, 2, FALSE) where source is the cell with the month, and where quarter_table simply converts 1-3 into Q1, 4-6 into Q2, and so forth.
Good regression analysis software will automatically handle a categorical variable (which you just created) by making X-1 dummy variables where X is the number of possible values for that categorical variable. For example, if your date range is from 1/1/2009 to 6/30/2011, and your category represents a calendar quarter, you'll have 10 possible values ("2009-Q1" to "2011-Q2"). The underlying regression will often use the first possible value "2009-Q1" as the baseline price index. The regression will create 9 dummy variables for "2009-Q2" to "2011-Q2". The value of these 9 variables will be the amount by which the baseline home value has changed from one period to the next. If you prefer to have another quarter as your baseline, most software will allow that. It's called changing the "reference level" and the
relevel() function will do the job in R.
Step 4: Execute the regression
Execute the regression. Primary drivers of home values are obvious: bedrooms, bathrooms, square feet, location, and garage spots. Put as many of these as you can get a hold of into the hedonic regression model. Once you have a basic model with these variables, you can try adding others to see if the model fit improves. To execute the regression, you'll need to make sure that you have no missing values for any data items. For now, just delete rows where the values are missing. There are whole dissertations on how to handle missing values in regressions. If this requires the deletion of too many rows, you'll have to forego the inclusion of the variable with many gaps. The core data should be saved in a .csv/text file with a header row and first few rows that look like:
saleprice,sqft,beds,baths,garages,quarter
249000,1200,2,1,2,"2009-Q1"
445000,1800,3,2,2,"2009-Q1"
430000,1900,4,2,2,"2009-Q1"
565100,2100,4,3,2,"2009-Q1"
You can execute the following commands to create a regression in R:
newdata <- read.csv ("yourfile.csv", header=TRUE, sep=",")
str(newdata)
The str() command will show you the structure of the data frame (R-speak for table) and you can verify that the data was successfully imported into R.
attach(newdata)
newdata.fit <- lm (salesprice ~ sqft + beds + baths + garages + quarter)
summary(newdata.fit)
The summary() command in R will provide you with the coefficients for the regression that best represents the relationship between the sales price and the explanatory variables (sqft, beds, baths, garages, quarter). In addition, R will provide an "R-squared" value as a measure of fit. In short, R-squared ranges between 0 and 1 and represents the percent of the variation in sales price that's explained by the model. Anything upwards of 0.9 is starting to be useful for a quick and easy hedonic. Even models with "R-squared" of less than 0.9 can contain useful information in their coefficients and are not without merit.
Step 5: Interpret the results
Convert the change in home values, given by dummy variables in Step 4 into actual prices. A quick-and-dirty way to do this is to simply find the median home sale data in your data from the first time period ("2009-Q1" in our case). Plot the actual median home value from the underlying data against the hedonic model's median and look for differences in the two. Such differences are caused by cases where an underlying factor in the hedonic model (but not in the median) has varied. Since Redfin's data source has beds, baths, sqft, and garage size it should generate a decent model for your area.
By popular demand, weíll be updating this page with actual results and more screen shots soon (noted 6/29/11).
Back to the Real Estate main page