# Quantifying Natural Beauty

## Using data to understand the economic value of scenery.

There are many factors that go into the price of a home. The job market, the quality of the schools, and the size of the home are all obvious factors that alter the final price. What about some less obvious factors, like the scenery? Are people willing to pay more based on what the view is like out the window, or are they more concerned with the availability of jobs and other more conventional factors? That’s the question I set out to answer.

## My Mission

To be concrete, my goal with this project was to predict the median sale price of existing single-family homes per city (defined by Metropolitan Statistical Area) based on natural features. I also wanted to see if more conventional features, such as population density, would be more predictive.

## Data Sources

For this project, I used an interesting dataset provided by the USDA, which they call the Natural Amenity Scale. This dataset has information for all the counties in the lower 48 states, with measurements designed to capture how nice the climate and scenery of the county are. These measurements include the summer and winter temperatures, the variation in topography, and the amount of water in the county.

To include non-environmental features, I found an index of urbanization per county. To this, I added the population density per county, which I got from Wikipedia.

For my target feature, the cost of housing, I used information provided by the National Association of Realtors. This dataset gave me the median sale price of existing single-family homes per metropolitan statistical area in 2019.

To apply the county-level data to the city level, I used a dataset that mapped MSA information to the corresponding counties.

## Project Assumptions

It’s important to be honest about what a model is really saying when we are trying to predict something. With that in mind, there are a few key assumptions that this project made.

First of all, the environmental data is from 1993. This model assumes that the summer and winter temperatures, water area, and relative humidity have not meaningfully changed between 1993 and today.

The updated measure of urbanization is from 2013, and the housing data is from 2019. So we are probably under-representing the intensity of urbanization in this model.

When we add all the assumptions made by a linear regression, we can see that this model is far from providing us some kind of absolute Truth about the world.

## Phase One: Gathering, Cleaning, and Exploring the Data

The most time consuming part of many projects is the task of getting all the data into one place, and getting it into a form where it’s ready for analysis. That was definitely the case for me.

Working in Python, I used the Pandas library to collect all my data, pull out the useful bits, and get it into a form that I could actually use. In order to get population density information from every county’s Wikipedia page, I created a function using the BeautifulSoup package to loop through the pages of every county, and find the population density of that county for collection and analysis.

## Phase Two: Preparing for a Linear Regression

Now that I had all my data ready and cleaned, it was time to get into actually analyzing the data.

But of course, the task of cleaning the data wasn’t really done. In order to perform a linear regression, I had to change all my variables into numeric ones. This meant that I couldn’t have a column for census regions ranging from one to nine. While this column was technically numeric, it didn’t make any sense to treat region as an ordered category. If I did that, I would essentially be saying that region 1 was the most like region 2, and region 9 was the most different, which was not a valid assumption. In order to get around this, I used a method known as one-hot encoding. I replaced the single column with nine columns, all containing zeros for the regions they were *not* in, and a 1 for their region. This way, I could use these categories for my liner regression.

## Phase Three: Choosing our Model

Finally! It was time to find an answer to my question, right? Not so fast. First I had to decide what *kind* of model to use. I had two questions to answer. Did I want to use a Ridge regression or a LASSO regression? What regularization strength did I want to apply?

The difference between LASSO and Ridge regressions has to do with how the model treats features that don’t have much predictive power. When Ridge encounters a feature like this, it will decrease the coefficient that goes along with that feature, but it will never reduce the coefficient to zero, and therefore that feature will always be taken into account in some small way. When it comes to LASSO, those features would have their coefficients lowered to zero, removing them entirely.

When it comes to regularization, that’s a whole topic for another time. If you want to learn more about what regularization means in the context of linear regressions, I recommend this article to get you started. To make a long story short, I needed to decide what importance I wanted to give to keeping the model simple, relative to reducing the error on the training set of data.

I decided to answer these questions through cross-validation.

## Initial Results

Surprisingly, both Ridge and LASSO regressions did very poorly on the validation data. Both were unable to predict more than 28% of the data, and were off by over $80,000 on average. This wasn’t great. I decided to perform some diagnostics on this initial modeling attempt. I plotted the predicted prices against the actual prices for the validation set.

A perfect description of the variation would see all the data points along the diagonal, in a straight line. As we can see, there are three cities that were predicted to be much more expensive than the rest. These houses were quite off from the predicted value.

Upon investigation, I found that all three outliers were Californian cities. Based on this, I concluded that they must be varying according to a feature that I wasn’t taking into account, something unique to Californian cities.

In the end, I decided to remove them. This reduces the generalizability of the model to California, but if it increases the accuracy for non-Californian cities, that was a sacrifice I was willing to make.

In addition, I added a polynomial term equal to the squared value of the natural amenity score. This was to capture the pattern that I saw- cities with very high natural amenity scores tended to be under-valued by my model. By including this new feature, I tried to capture this pattern.

With these optimizations in place, I moved on to the question of what features were most predictive.

## Natural Features vs Human Features

When I attempted to compare a regression based on natural features only to a regression based on human features, I was surprised. I found that both models were equally *bad* at explaining housing prices. The natural-only model was only able to explain 49%, while the human-only model explained 47%.

An insight here is that each model is deriving about the same predictive power, but from totally different sets of features. That means that both these sets of features are uniquely useful.

## Combined Model

When I combined the natural and human features, I was able to get a better handle on the variation in home prices. I was able to explain about 70% of the variation based on the data on which the model was trained, but unfortunately the predictive power was much lower when applied to new data. On that data, the combined model was only able to explain 52% of the variation, and on average the model was off by over $70,000.

## What Can We Learn?

Despite the fact that the model itself isn’t very useful in predicting home prices, we still have some useful findings. I think that existing models that seek to predict home prices would be wise to include the sorts of natural features that I included in this model. Since the natural-only model was able to predict a good amount of variation in price, that lets me know that natural features are describing a real relationship that current models may not be capturing.

## Sources

If you want more information on the details of this project, I recommend checking out my GitHub repository.