Anticipating Wildfires with Machine Learning Tools

Using data to uncover ideal wildfire suppression resource placement

Published in

Towards Data Science

5 min readJun 9, 2021

Staring out the window of my middle school science classroom, it seemed like Hell itself had arrived on the outskirts of Boulder, Colorado. The smoke practically covered the sun, and the lesson plan was disrupted by the sounds of helicopters and low-flying planes buzzing back and forth to deliver water and flame retardant to the Fourmile Canyon fire. That fire would go on to burn 6,181 acres, destroy 168 homes, and cost an estimated $217 million in property damages.

Unfortunately, that fire is far from an anomaly these days. In 2020, the wildfires in the western United States cost up to $13 billion in property damages, not to mention the lives lost and the heartache of countless families forced to rebuild.

Wildfires aren’t an unstoppable force — it was the efforts of humanity that were able to put a stop to the Fourmile Canyon fire, and to many other fires that have threatened our communities. One of the most important tools for fighting fires is understanding where the most high-risk areas are, so that we can pre-deploy firefighting resources to arrive as soon as possible.

I was inspired to create a classification tool that could help humanity better stage firefighting resources after my time in the Data Science bootcamp offered by Metis. I wanted to apply my skills to this problem that had been a part of my life for years.

Scientific background

In order to understand the problem, we have to turn to the field of pyro-geography, the study of the geographic distribution of wildfire. When I went to college to get my degree in Environmental Science, I learned the basics of this field, and that there are three main considerations when it comes to understanding fire risk: fuel, climate, and terrain.

Fuel is a pretty obvious one. In areas where there’s a lot of dead wood to burn, fires will be more likely to start, and may grow more quickly than in other areas.

Climate is also clear. In areas that have lower than usual humidity, or higher than usual temperatures, fires would be more likely and more intense.

Terrain is a more interesting variable. Fires tend to spread more quickly when traveling uphill, and fires are also more difficult to put out in hilly areas, meaning that fires here qualify as higher risk than in flat, more easily accessible areas.

I wanted to establish proxies for all of these features so that I could take them into account when estimating fire risk.

Methods

I found a dataset of 1.88 million geo-referenced US wildfires on Kaggle. This would serve as my target feature. Of course, this needed substantial reformatting since this didn’t provide data for any county that hadn’t had a wildfire, only having records of existing wildfires. Additionally, there were limited data for eastern states, forcing me to narrow my analysis to the western United States. I imported this data using SQLite.

As for my predictor features, I retrieved data from multiple sources. I was able to collect climate data from NOAA, terrain data from the USDA, and fuel data from the US Forest Service. After stitching these all together, I had a pretty good picture of the variables that I was working with across the western US.

I used a variety of models in an iterative process, but the most powerful model I found was the Extra Trees Classifier from the scikit-learn package. This algorithm is a variant of the Random Forest, meant to further reduce variance by randomly changing the cutoff point at each node in each decision tree. For more information on the Extra Trees algorithm, I recommend this article.

One of the most important aspects of this classifier tool that I wished to build would be its flexibility. I wanted a user to be able to raise or lower the threshold when it came to what fire size they wanted to pinpoint. This was because different fire suppression tools have different availabilities and therefore are going to be best suited for different severities. When deploying fire teams, it makes sense to check out all the small, medium, and large fires that are predicted to occur. However, when distributing larger, more expensive resources like airplanes and helicopters, a user should be able to narrow down the results to a smaller, more severe range of fires.

Because of this, I needed to build in a dynamic tool to prevent class imbalance. When a user is looking for only the largest fires, these may consist of only a few percent of the total number of observations that the model is trained on. In order to maintain performance, I need to make sure there are enough examples from the positive class for the model to work with. By checking for imbalance during each predictive run of the model, I was able to keep performance higher than it otherwise would have been when looking at very large fires.

Once I was confident that I had adequately tuned the hyperparameters of my model, and had tested it at various cutoff points, I moved on to deployment. I wanted this tool to be useful for a theoretical user, and so I made the decision to host it online using Streamlit.

Streamlit is a great, simple tool to allow hosting of python code on a web browser. I highly recommend it for anyone looking to deploy models without too much of a hassle. If you want to find more information on how to implement a streamlit deployment, I recommend their excellent guide.

Results

Here we can see an example of the deployed tool. The user can select a cutoff point at a given percentage of the county expected to burn in the upcoming fire season, and then see a result of which counties, in red, are expected to meet or exceed that threshold.

Below, metrics display the accuracy and precision of the tool to the user. Precision ranged from 0.88 to 0.83, while recall varied much more, from 0.93 to 0.29. The metrics depended on the cutoff point that the user selected. As the threshold fire size increased, precision improved and recall got worse.

Conclusion

This tool exceeded my expectations for what a static dataset, with lots of holes and gaps, could accomplish when paired with basic measures of climate and terrain. I would love to see how this model would work with more fine-grained spatial information and more data in general when it comes to fire extent and severity.

If you would like to see the guts of this project, head over to my GitHub where you can see every curve in the road of this project. Thanks for reading, and I hope you leave with a renewed appreciation of the power of nature, as well as our power to understand it using these tools.