GeoPandas Archives - Matthew Gove Blog

How to Boost Your GIS Productivity with Python Automation in 5 Minutes

Matt Gove — Fri, 05 Nov 2021 16:00:00 +0000

Python Automation is one of the most powerful ways to improve your GIS workflow. In the past, many tasks in traditional GIS applications have had minimal support for writing your own code, and often required crude hacks to install obscure libraries.

As Python has rapidly grown in both functionality and popularity, it is now widely supported across, and even built into many GIS platforms. Adding Python scripting to your GIS workflow can accomplish tedious hours-long tasks in seconds. Full automation of your GIS processes with Python will free you up to focus the more important aspects of your project, regardless of what industry you’re in.

Automate Your Desktop GIS Application

Did you know that ESRI ArcGIS and QGIS are both written in Python? As a result, Python automation integrates effortlessly with both GIS platforms. The Python libraries for each platform are incredibly powerful, fast, and easy to use.

However, be aware that the Python libraries for ArcGIS and QGIS are specific to each platform. If you ever decide to change platforms, you’ll need to rewrite all of your Python scripts.

QGIS Window with Python Console

I recommend starting small to get your feet wet with GIS Python automation. Start by automating the symbology and color of your data before diving into data manipulations, calculations, and analysis. Then you can start tackling more complicated processes, such as file conversions, modifying layers, switching projections, and much more.

Automate Your Web-Based GIS Application

Automating web-based GIS applications with Python is not quite as seamless as with ArcGIS or QGIS. However, you can easily argue that it’s even more powerful. Web-based GIS applications are a bit more complicated than desktop-based platforms. In addition to the GIS software, you often need special servers and databases that are designed specifically for geospatial data.

Thankfully, this increased complexity also means that there are more opportunities for automation. I use Python automation on nearly all of my web-based GIS applications. I don’t have tutorials for all of these yet, but here are a few of my favorites.

Generate Region Mapping Files and/or Metadata
Set Up and Configure the Server
Create and Configure your Geodatabase or Data Repository
Data Entry and/or Analysis
Convert ESRI Shapefiles into a Web-Friendly Format
Mathematical Modeling

Python Automation Updates Our COVID-19 Dashboard Every Day

Remote Sensing Automation with Python

Most sensors these days come with Python libraries when you buy them. You should absolutely take advantage of those libraries. With Python, you can calibrate and configure the sensors exactly how you want them, not the way the manufacturer wants them.

In May of 2019, I installed sensors on the weather station I have at my house. The weather station runs on a network of Raspberry Pi’s. A Python script reads the data from each sensor, QA/QC’s it, and records it in the weather station’s database. If a sensor goes offline or makes a bad reading, the weather station pulls the data from the National Weather Service.

DIY Weather Station: Building a Solar Radiation Shield from Scratch
Wiring Power and Internet to the Sensors
Installing the Data Logger and Connecting the Sensors
Database Configuration and Programming the Sensor Readings
Troubleshooting a Sensor Gone Awry

https://youtube.com/watch?v=twZNWximYd0

Take your remote sensing automations even further. Use Python GeoPandas to plot your data on a map. Perform a high-level data analysis using pandas or matplotlib. You can easily automate the whole process or leave yourself as much manual control as you wish.

Data Entry Automation with Python

Without data, you don’t have a GIS application. You just have a map. Furthermore, geodatabases and data repositories come in all different shapes and sizes. Thankfully, Python can easily handle all of these data types and schemas thanks to its robust and dynamic data science libraries.

Python’s pandas library is one of the most powerful data analysis libraries available in any programming language. The fact that it’s free and open source is even more incredible, given how expensive licenses to proprietary software can be. pandas can handle just about any data format and size you throw at it.

However, pandas on its own does not support any geographical or location-based data. Enter Python’s GeoPandas extension of the pandas library. GeoPandas gives you the ability to analyze geospatial data and generate maps using the same tools you have in pandas. Easily populate a geodatabase or assemble a repository of any supported GIS format, including shapefiles, CSV, GeoJSON, KMZ, and much more. For more information, please visit our collection of GeoPandas tutorials.

Python GeoPandas can create beautiful maps without a GIS application

Data Analysis Automation with Python

With over 12 years of experience in professional data analysis, I know firsthand how tedious having repetitive tasks can be. Instead of the monotony of having to repeat those tasks over and over, why not automate them with Python? After all, Python developers created both pandas and matplotlib for that exact purpose. In the context of GIS, you can fully or partially automate many common tasks.

Repetitive tasks to prepare and/or format the data for analysis
Create maps of different areas using the same data parameters
Generate multiple maps of the same areas using different data parameters

Python has plenty of powerful data analysis libraries available for geospatial data

How to Trigger Your GIS Automation

To reach the nirvana of full automation, a Python script alone is not enough. You also need to automate the automation. Fear not, though, triggering your automation is the easy part. You have two options to choose from.

Trigger Your Automation to Run at a Set Time

The majority of GIS automations run at the same time every day. Our COVID-19 dashboard is the perfect example of this. We have a Python script that downloads the COVID-19 data from the Johns Hopkins University GitHub repository, parses it, and adds it to our database. Unfortunately, our web hosting plan does not allow us to fully automate the script, so we automate it on a local machine and then upload the database to the production server.

Scheduling the automation on your computer or server is quick and easy. On Linux and Mac OS, use the crontab command to schedule a cron job, which is a job that runs at fixed dates, times, and/or intervals. Alternatively, use the Task Scheduler on Windows. Both schedulers give you full flexibility to schedule jobs exactly when you want them to run.

Trigger the Script to Run When a Specific Event Occurs

Alternatively, not all jobs run at a specific time or interval. Have a look at the map of the Matt Gove Photo albums and videos. There is no logical need to run the job at a set time or interval. Instead, we update the map whenever we add a photo album or video to the database. As soon as the new album or video lands in the database, it automatically adds the data to the map.

In Python, the simplest way to trigger your GIS automation is a call to a function that runs the automation. For example, let’s look at the logic of adding photos and videos to the Matt Gove Photo map. In its simplest form, the logic for adding a photo album would look something like this.

# Add a Photo Album to the Database
add_photo_album_to_database(album_parameters)

# Once the database is updated, update the map
update_map(album_parameters)

This example is very oversimplified, but you get the point. For even finer control, use conditional logic and loops for precisely triggering your scripts exactly when you want.

Don’t Forget to Test Your Automation Scripts Before Putting Them into Production

We all make this mistake at one point or another. You beam with pride when you finish your automation script, and schedule it to run for the first time overnight. The next morning, you log in eagerly expecting to see the output of your automation. Instead, you see nothing, or even worse, you see an error message. You facepalm yourself because you forgot to test everything!

The best way to test your automation is to write a few quick unit tests once you finish your script. If you’re unfamiliar with a unit test, it tests individual units of code in your script. You tell the test the expected outcome for a given parameter, and then run that parameter through the block of code. If the script output matches the expected output, the test passes. If not, it fails.

For example, let’s say you programmed a calculator application. To set up a unit test for addition, execute 2 + 2 with the calculator, and see if you get 4. Repeat the process with unit tests for subtraction, multiplication, and division. The best part about unit tests is that you can run a lot of them in a short amount of time. If you’ve written them correctly, they’ll tell you exactly where in the script any problems are.

Use Creativity and Innovation in Your Python Automation

Once you get your feet wet with GIS automation using Python, keep automating. I encourage you to get creative and come up with new, innovative ways that will improve your workflow even further. The sky really is the limit when it comes to automation.

Conclusion

The days of managing bloated and complicated workflows with expensive software are a thing of the past. Python automation is the future, not just in GIS, but also in nearly every industry out there. Start out with simple tasks to whet your palette. Once you get a taste of it, don’t be afraid to tickle that creative or innovative itch we all have. You’ll be amazed at the amount of time and money it can save. Let us help you get started today.

Top Photo: View of Death Valley from Sea Level
Death Valley National Park, California – February, 2020

The post How to Boost Your GIS Productivity with Python Automation in 5 Minutes appeared first on Matthew Gove Blog.

13 Stunning Examples Showing How Easy It Is to Spread Disinformation without Manipulating Any Data

Matt Gove — Fri, 30 Jul 2021 16:00:00 +0000

The spread of disinformation and fake news seems like it’s about as American as apple pie these days. As a data scientist, it’s beyond horrifying watching so much disinformation rip through every facet of our society like wildfire. Sure, you grow to expect it from the idiots on the internet. But the fact that it now dominates everything from the news media to our education system to our jobs? That’s much more concerning.

Before we get too far, I want to say that the content of this post is designed for educational purposes only. I do not endorse the spread of disinformation or any conspiracy theories in any way. You should always back up your arguments with strong logic and easily-verifiable facts.

Recent statistics about disinformation over the past year or two are eye opening.

67% of Americans have interacted with disinformation or face news on social media.
56% of Facebook users cannot identify fake news that aligns with their own beliefs.
Less than 30% of American adults trust the news media.
In the third quarter of 2020 alone, Facebook saw over 1.8 billion engagements with fake news.

And that’s not even the tip of the iceberg.

How Do We Create and Spread Disinformation?

Sadly, it’s far too easy to create, publish, and spread disinformation these days. There is an endless list of different methods to create disinformation, but here are a few of the more popular ones.

Manipulating Data or Statistics
Using Logical Fallacies
Making an argument that uses flawless logic, but the statements that are input into the argument are false
- Example: Rocks are vegetables. I like to eat vegetables. Therefore, I like to eat rocks.
Injecting technical jargon and fancy words into a statement that is otherwise complete BS
Just making something up off the top of your head.

One of My First Memorable Encounters with Real World Disinformation

One my first encounters with disinformation in the “real world” came after graduating into the teeth of the Great Recession in 2009. Like so many people at the time, I struggled mightily to find work. As the election season began heating up, it was quite clear that Republicans were going to do very well in the 2010 midterms. At the time, Democrats controlled the House, the Senate, and the White House. The economic recovery was moving painfully slowly, and unemployment remained stubbornly high.

Then, all of a sudden, shortly before the 2010 midterms, the unemployment rate mysteriously dropped, and it dropped a lot. What happened? Was the recovery finally kicking into high gear? Not really. Turns out, the number of unemployed people hadn’t really changed at all.

Instead, the Obama administration had decided that they didn’t like the optics of high unemployment levels, so they changed how the unemployment rate was calculated so it looked lower than it actually was. Long term unemployment was a particular problem coming out of the Great Recession, so they simply stopped including the long-term unemployed when they calculated the unemployment rate. Thankfully, the media called them out on it. As a result, the different methods of calculating the unemployment rate became much more transparent.

The Most Insidious Way to Spread Disinformation: A Look at the 2020 Election and the COVID-19 Pandemic

Today, we’re going to look at one of the most subtle, insidious, and incredibly effective ways to spread disinformation. You don’t need to manipulate any data or statistics. Nor do you need to tie yourself in knots using pretzel logic to make your argument.

Indeed, all you need to use is a little equivocation. When you equivocate, you tell part of the truth, but not the whole truth. The part of the truth you don’t want revealed is usually obfuscated in vague language. When done effectively, you’re not telling the whole truth, but you’re not telling a bold-faced lie, either.

Disinformation Spread in the 2020 Election: It All Starts with a Simple Map

Take yourself back to election night. You’ve cast your vote, and it’s time to sit down and watch the election returns. Regardless of which TV network or website you’re watching, they’re filling in this map.

On the surface, this map looks completely harmless. More importantly for the TV networks, their audience understands this map without needing any explanation.

In reality, this map is one of the most misleading ways to present election returns that exists. It infuriates me to no end that people still use it. One of the most common arguments I hear from people who look at this map is that there is so much more red than blue on the map, there is no possible way Trump lost the election.

There’s A Lot This Map Does Not Show

It’s true, there is far more red than blue on the map. And that’s exactly why the map is so misleading. To pop holes in that argument, let’s look at what the map shows and what it doesn’t show.

What the Map Shows

The winner of each county

What the Map Doesn’t Show

How many votes were cast
The population of each county
The margin of victory
The percentage of the vote each candidate received

To further show how useless that map is, let’s compare it to the results of the 2016 election. Recall that in the 2020 election, Biden won 306-232 in the Electoral College. In 2016, Trump won by that exact same margin. Now compare the two maps using the slider. Can you easily tell which candidate won?

2016

2020

Not only can you not easily tell which candidate won, the 2016 and 2020 maps are practically identical. The only county with any significant population that changed colors between the two elections was Maricopa County in Arizona. This map has played a significant role in Maricopa County being the target of so many election-related conspiracy theories.

Introduce Population and Vote Tallies into the Map to Improve It

In order to better present the election results, you’ll need to incorporate at least one of either population or number of votes cast. Ideally you can incorporate both. First, let’s look at map of population by county.

If you overlay the population map on either map of election results above, you should notice a very distinct correlation. The Democrat candidate won the more populous counties almost exclusively. When you have such a perfect correlation, it means that you have figured out which statistic is skewing the data on your maps and leading to the spread of disinformation.

So exactly how do we show population on our map? The easiest way is to put a colored dot inside each county instead of shading the entire county. Then scale the diameter of the dot based not on population, but instead on the number of votes cast for the winning candidate. Like our choropleth map, the dots be shaded blue or red to indicate which candidate won.

It’s not perfect, but it gives a much more accurate picture of the 2020 election results.

For comparison, here’s the same map for the 2016 election.

But Wait, Trump Won the 2016 Election 306-232. This Map Doesn’t Reflect That!

Good catch! You’re partially correct. Trump did win the 2016 election 306-232. And the 2016 map does show a lot more blue on it. So what gives? Trump won the Electoral College vote in 2016, but Hillary Clinton won the popular vote. The election maps with the scaled dots on them reflect the popular vote, not the Electoral College vote.

Vote	Donald Trump	Hillary Clinton
Electoral College	306	232
States Won	30	20, plus DC
Total Votes Cast	62.9 million	65.8 million
Percentage of Vote	46.1%	48.2%

2016 Election Voting Statistics

A Look at 2004: The Most Recent Election the Republican Candidate Won the Popular Vote

The 2004 presidential election marks the only time in recent history that the Republican Candidate won the popular vote. In 2004, President George W. Bush won both the Electoral College (286-251) and 50.7% of the popular vote (62 million to 59 million). Our map does correctly indicate that Bush won the popular vote that year.

So Can We Create an Electoral College Map That Does Not Spread Disinformation?

Because the Electoral College is a state-level process, it’s impossible to do so at the county level. However, we can recreate the map using scaled dots to represent the Electoral College. Like the county-level choropleth maps, population skews the Electoral College choropleth maps, leaving the ripe for the spread of disinformation as well.

2020

2016

Can Any Maps Debunk the Spread of Election Disinformation and Conspiracy Theories?

Maps can certainly explain what happened in Trump’s rise to power in 2016 and Biden’s triumph in 2020. Unfortunately, people that believe in conspiracy theories are often so irrational, it’s unlikely to convince them.

To show what led to Trump’s rise as well as his demise, let’s brainstorm a few changes we may want to look at when comparing the 2020 election to 2016.

Demographics
Voter behavior
Candidate popularity
Voter turnout

To save you the hassle, we’re going to look at the total voter turnout between the two elections, as well as who those voters were voting for. We’ll do this for each county. The math is simple, just addition and subtraction.

total_vote_difference = total_votes_2020 - total_votes_2016
dem_vote_difference = dem_votes_2020 - dem_votes_2016
rep_vote_difference = rep_votes_2020 - rep_votes_2016

To determine which candidate gained the most ground, simple compare the Democrat vote differences to the Republican vote differences.

vote_difference = dem_vote_difference - rep_vote_difference

If vote_difference is a positive number, it means the Democrats gained votes. If it’s negative, the Republicans gained votes. The larger the magnitude of vote_difference, the bigger those gains were.

Let’s Look at Maps of Vote Gains

Let’s look at those maps. In addition to comparing 2020 to 2016, I’ve included a map that compares 2016 to 2012.

Let’s also look at total voter turnout in each county.

There are a few conclusions I can draw from these maps to combat disinformation and conspiracy theories.

Metric	2020	2016
Voter Turnout	Trump was so polarizing, he turned out massive numbers of voters on both sides.	Many “on-the-fence” voters, especially those that lean Democrat, stayed home for various reasons.
Candidate Popularity	As ferociously devoted as Trump’s base was, Democrat voters hated him even more.	Both candidates were wildly unpopular. Many voters felt Trump was the lesser of two evils.
Independents	The independents that went for Trump in 2016 turned on him in 2020. Many moderate Republicans voted for Biden, too.	Many independents, especially across the Rust Belt, voted for Trump. The numbers out of Detroit are particularly fascinating.
Suburban Voters	Suburban voters revolted against Trump. There are huge Democratic gains in nearly every major city	Dem-leaning suburban voters stayed home or went for Trump, particularly in Detroit and Milwaukee.
Where The Election Flipped	Large Democratic turnout in 6 metropolitan areas won the Election for Biden: Philadelphia, Pittsburgh, Detroit, Atlanta, Phoenix, and Milwaukee	Rust belt voters that felt abandoned by Obama came out in droves for Trump, and flipped Pennsylvania, Ohio, Michigan, and Wisconsin, a total of 64 Electoral Votes
Florida	Trump picked up significant votes in Miami-Dade County (likely Cuban Americans voting against socialism), giving him a comfortable win in the state.	The Interstate 4 Corridor (Tampa to Daytona) that delivered the state to Obama in 2012 swung significantly back to the right and went for Trump.

All right, enough about the election. Let’s move on and look at some COVID-19 data.

The COVID-19 Pandemic: A Stunning Exercise in the Spread of Disinformation

If there’s anything that’s torn through the United States faster than COVID-19 itself, it’s the disinformation associated with it. No matter what facet of the pandemic we’re talking about, we cannot agree with our fellow Americans on anything.

Want to know what’s even more frightening? It’s even easier to spread disinformation about COVID-19 than it is about the election. And we don’t have to worry about the election putting us in the hospital or killing us.

The Default COVID-19 Maps are Plagued by the Same Population Issue the Election Is

By default, most media outlets show new daily COVID-19 cases by either state or county. While that’s perfectly fine if that’s what you’re looking for, it is a terrible map if you’re trying to identify hot spots. Here’s a recent map of new daily COVID-19 cases in the United States. Take a guess as to where the hottest spot for COVID-19 is.

New Daily COVID-19 Cases in the United States – 18 July, 2021

Looking at this map, you’ll likely identify two hotspots: Florida and the Southwest. Yes, COVID-19 is raging in Florida, Los Angeles, and Las Vegas, but neither of those spots is where the worst outbreak is. And where is that outbreak right now? It’s in Missouri and Arkansas, but you wouldn’t know it looking at this map.

Color Schemes: The Most Insidious Way to Spread Disinformation

The color bar on any map seems innocent enough. Its primary purpose it to make your map look really good. How bad can it be?

Turns out, the color scheme is particularly deceptive. You don’t need to do anything to the actual data. Nor do you need to twist yourself up in pretzel logic just to make your point. Even worse, people choose bad color schemes accidentally all the time, spreading disinformation without even realizing it.

While there are all kinds of ways to manipulate the color bar, here are the three most common.

Change the Upper and/or Lower Limits of Your Color Bar

Look at the map of new daily cases above. The data range goes from 0 to 1,462 new daily cases. Now what would happen if I increased the upper limit by an order of magnitude, from 1,500 to 15,000? All of the counties would be shaded either white or very light green, and it would look like there’s no COVID-19 at all.

Conversely, what if I reduced the upper limit from 1,500 down to 5? It would look like the world is about to end, with COVID-19 spreading everywhere. That’s clearly not an accurate representation of what’s going on, either.

Don’t forget, both maps show the exact same dataset. All we did was change the color bar.

Change the Break Points of Your Color Bar

By default, most mapping and GIS programs default to breaking the color bar up in even increments or so that points are distributed evenly throughout the color bar. While neither is perfect, they work well is many cases.

Now let’s take this to the extreme. For this example, you’re a corrupt leader who wants to publish a map showing no COVID-19, despite the fact that it’s raging in your area. Using the same 0 to 1,500 scale, you set the first section of the color bar to cover 0 to 1,300. The remaining colors are set in increments of 50: 1,301 to 1,350; 1,351 to 1,400, and so forth.

That map makes it look like there is basically no COVID-19 spreading in the United States.

Alter the Number of Breaks in Your Color Bar

While there are certainly isolated circumstances when you want to increase the number of breaks, this method is far more effective when you reduce the number of breaks in your color bar. In our original map, there are 7 breaks for a total of 8 colors.

Now, let’s reduce the color bar from 8 colors to 2. The light yellow color will cover 0 to 750 new cases per day. Likewise, the dark blue color will cover 751 to 1,500 new daily cases.

As for the result? Once again, it looks like there is no COVID-19 in the United States. On other days, though, some areas that are raging look like there’s nothing there. At the same time, other areas that do not have a problem look like COVID-19 is exploding out of control. Talk about disinformation!

I Shouldn’t Give You Any More Ideas to Spread Disinformation, But…

I know what you’re thinking. There’s no way people can so blatantly manipulate the color bar and get away with it. Your intuitions are correct, but those examples we just looked at are extreme examples.

You can easily combine these methods to much more subtly mislead your audience. There are also plenty of other ways to mess with the color scheme that I haven’t touched on here. One easy way is to invert the colors. You can also use an illogical progression of colors throughout the color bar.

This is why when you look at any kind of figure, you should always verify both the color scheme and its limits before you make any assumptions about it. All it takes is a quick glance at the legend.

Use Logarithmic Scaling to Reduce Color Bar Manipulation

So is there anything we can do to reduce such easy color bar manipulation? If you’re dealing with a large range of data, use logarithmic scaling. For those of you who are unfamiliar with the logarithmic scale, it’s simple.

Instead of incrementing your axis in multiples of a number, you’re incrementing it by powers of that number. For example, a linear scale using multiples of 10 would be 10, 20, 30, 40, 50, 60, and so on. A logarithmic scale using powers of 10 would be 1, 10, 100, 1,000, 10,000, 100,000, 1,000,000, and so on.

Why a logarithmic scale? First off, it has preset intervals, so it’s very difficult to subtly alter the breaking points in your color bar. The logarithmic scale’s preset intervals also limit or prevent the data from shifting if you change the limits of the color bar. For example, on the COVID-19 map, 400 new daily cases will fall in the 100 to 1,000 section, no matter how high I set the upper limit of the color bar.

What Color Scale Do I Use?

On my COVID-19 Dashboard Map, I use a hybrid logarithmic scale. It’s simply a logarithmic scale with breaks half way through each section of the scale. So instead of break points being at 1, 10, 100, 1,000, and so forth, they are at 1, 5, 10, 50, 100, 500, 1,000, 5,000, and so on.

The reason I chose a hybrid logarithmic scale is because the data range was not big enough to use a straight logarithmic scale. As a result, the map would have been too misleading, and would not have accurately shown areas where COVID-19 is surging.

Look at Other Parameters to Counter Disinformation

Listen to your gut. If it’s telling you a map or figure is misleading, it likely is. Regardless if you’re looking at published map or creating a map to publish, look at other parameters in the same dataset. The more parameters that backup your reasoning, the stronger your argument will be.

Normalize the Data by Population

In our COVID-19 dataset, the easiest way to get around the population issue is to normalize the data by population. Instead of the raw number of new daily cases, plot the number of new daily cases per million people.

New Daily COVID-19 Cases per 1 Million People – 18 July, 2021

That’s a big step in the right direction. You can at least see the big outbreak of cases in Missouri and Arkansas. However, Florida is also getting hit very hard right now, and this map makes Florida look a lot better than it actually is.

14-Day Change in New Daily Cases

Next up, let’s look at the two-week change in new daily cases. It’s a great map for identifying which way cases are trending, but it can be very misleading if you don’t know how to interpret it.

For example, if a county has just peaked and is starting to decline, the county will show bright green. Woo-hoo, right! Not so fast. You’re just past the peak. COVID-19 is still raging.

Here’s what the recent map looks like.

14-Day Change in New COVID-19 Cases – 18 July, 2021

You should never rely on this map alone to make any decisions related to COVID-19. When you start analyzing the map, keep in mind that this map only shows the trends. It does not show how much COVID-19 is in the counties. Look at Massachusetts. It looks like it’s in worse shape than Missouri and Arkansas.

The map doesn’t show that Massachusetts has incredibly low case loads because it’s the most vaccinated state in the country. On the other hand, Missouri and Arkansas have some of the lowest vaccination rates in the country, which is why the Delta variant is ripping through their communities at such an astonishing rate.

Active Cases Per Million People

The number of active cases per million people looks very similar to the new daily case loads per million people. As a result, you can see the big surge in Missouri and Arkansas, but the surges in both Florida and Las Vegas are lost in the noise.

Active COVID-19 Cases per 1 Million People – 18 July, 2021

Odds Any One Person You Interact With in Public is Infected

When I drove across the country at the peak of the COVID-19 pandemic last winter, I wanted to minimize my risk of contracting the virus. Calculating the odds that any one random person you cross paths with is a great way to do that. All you need to do is divide the number of active cases by the population.

Again, it’s plagued by the same issue. You can see the big COVID-19 outbreak in Missouri and Arkansas. However, it doesn’t pop off the page and instantly draw your eye to it. Nor can you really see the ongoing surges in Florida or Las Vegas.

Odds Any 1 Random Person is Infected with COVID-19 – 18 July, 2021

None of These Plots Show Hot Spots Well. What Now?

I know what you’re thinking. You just spent this entire post explaining how easy it is do spread disinformation through color bar manipulation. You can’t be about to suggest it now just to show where the COVID-19 outbreaks are.

Rest assured, we will not be doing anything to the color bars. Doing otherwise is flat out hypocritical. Instead, we can use Matt’s Risk Index. The index is essentially a weighted average of all of the parameters we just looked at. It’s designed to make hot spots and high-risk areas really jump off the page. If you’re interested in the math behind Matt’s Risk Index, we discussed it in detail when I first unveiled the index last winter.

Before looking at Matt’s Risk Index, recall where the hot spots in the United States are right now.

Missouri and Arkansas
Florida Peninsula
Clark County, Nevada (Las Vegas)
Los Angeles County, California

LA County’s huge population likely keeps its risk level quite low for now, but the other three areas should leap off the page when you look at Matt’s Risk Index.

Matt’s COVID-19 Risk Index – 18 July, 2021

The Matt’s Risk Index map also seems to confirm health officials’ concerns that the southeast US is at very high risk for a Delta variant surge. Louisiana, Mississippi, Alabama, and Tennessee are some of the least vaccinated states in the country, and there are significant outbreaks of the Delta variant on either side of them right now.

My Favorite Example: Georgia’s Stunningly Boneheaded Decision to Spread COVID-19 Disinformation

What goes through the minds of some people when they make graphics is beyond me. In May, 2020 the Georgia Department of Health tried to make its argument to its citizens that it was okay to reopen everything and resume our normal day-to-day lives. COVID-19 was a thing of the past.

To support their argument, the State of Georgia published a chart that at first glance showed steadily declining COVID-19 cases. Unfortunately, when you took a closer look, one small problem appeared. The dates were in the wrong order.

Where does Sunday take place twice a week? And May 2 come before April 26?
The State of Georgia, as it provides up-to-date data on the COVID-19 pandemic.
In the latest bungling of tracking data for the novel coronavirus, a recently posted bar chart on the Georgia Department of Public Health’s website appeared to show good news: new confirmed cases in the counties with the most infections had dropped every single day for the past two weeks.
In fact, there was no clear downward trend.
Atlanta Journal Constitution

You can read the full story from the Atlanta Journal Constitution.

Thankfully, Governor Brian Kemp’s office quickly fixed the error as soon as they got called out for spreading disinformation. But there is no reasonable excuse at all to be publishing that garbage in the first place, let alone the middle of major public health emergency.

Not surprisingly, the late night comedians had a field day with it.

Data and Source Code That Generates the Maps in This Post

I believe in transparency, especially when it comes to the spread of disinformation. You can find the Python code and the data that is used to generate every map in this post in our Bitbucket Repository.

Data Sources

Dataset	Source
County Presidential Election Results	MIT Election Data and Science Lab
Electoral College Results	US Federal Government National Archives
COVID-19 Data	Queried from our COVID-19 Dashboard database, which gets its data from Johns Hopkins University

Conclusion

In today’s era of disinformation, it’s shockingly easy to spread disinformation. Maps are one of the easiest, subtlest, and most effective ways to spread a wealth of disinformation. The double-barreled combination of the 2020 Election and the COVID-19 pandemic hit the United States with a tsunami of stupidity that has proven time and time again to have deadly consequences.

Thanks to data gurus around the world, disinformation is being called out more than ever before. Armed with the proper knowledge and logic, you can easily recognize, call out, and disprove disinformation. Today, I ask you for one small favor. Reach out to your favorite data guru, and express your appreciation for their work. Follow them on social media, donate some money to their cause, or simply thank them for their efforts. It’s a small gesture that can make a big impact both in your world and theirs.

Top Photo: The Snow-Capped Sierra Nevada Provide a Stunning Backdrop to a Beautiful Winter Day at Lake Tahoe
Glenbrook, Nevada – February, 2020

The post 13 Stunning Examples Showing How Easy It Is to Spread Disinformation without Manipulating Any Data appeared first on Matthew Gove Blog.

Python Tutorial: How to Create a Choropleth Map Using Region Mapping

Matt Gove — Fri, 23 Jul 2021 16:00:00 +0000

Several weeks ago, you learned how to create stunning maps without a GIS program. You created a map of a hurricane’s cone of uncertainty using Python’s GeoPandas library and an ESRI Shapefile. Then you created a map of major tornadoes to strike various parts of the United States during the 2011 tornado season. You also generated two bar charts directly from the shapefile to analyze the number of tornadoes that occurred in each state that year. However, we did not cover one popular type of map: the choropleth map.

2011 Tornado Tracks Across Dixie Alley

Today, we’re going to take our analysis to the next level. You’ll be given a table of COVID-19 data for each US State in CSV format for a single day during the COVID-19 pandemic. The CSV file has the state abbreviations, but does not include any geometry. Instead, you’ll be given a GeoJSON file that contains the state boundaries. You’ll link the data to the state boundaries through a process called region mapping and create a choropleth map of the data.

Why Do We Use Region Mapping to Create Choropleth Maps?

The main reason we use region mapping is for performance. When you use region mapping, you only need to load your geometry once, regardless of how many data points use that geometry. Each data point uses a unique identifier to “map” it to the geometry. You can use the ISO state or country codes, or you can make your own ID’s. Without region mapping, you need to load the geometry for each data point that uses it.

To show you the performance gains, let’s use COVID-19 data as an example. In our COVID-19 Dashboard’s Map, you can plot data by state for several countries. For Canada, the GeoJSON file that contains the provincial boundaries is 150 MB. We’re roughly 500 days into the COVID-19 pandemic. A quick back-of-the-envelope calculation shows just how much data you’d need to load without region mapping.

data_load_size = size_of_geojson * number_of_days
data_load_size = (150 MB) * (500 days)
data_load_size = 75,000 MB = 75 GB

Keep in mind, that 75 GB is just for the provincial boundaries. It does not include any of the COVID-19 data. And it only grows bigger and bigger every day.

Region Mapping and Vector Tiles Allow Us to Load Canada’s Provincial Boundaries into our COVID-19 Map using Less Than 2 MB of Data.

Using region mapping, you only need to load the provincial boundaries once. With the GeoJSON file, that’s only 150 MB. In our COVID-19 map, we actually take it a step further. Instead of GeoJSON format, we use Mapbox Vector Tiles (MVT), which is much more efficient for online maps. The MVT geometry for the Canadian provincial boundaries is only 2 MB. Compared to possibly 75 GB of geometry data, 2 MB wins hands down.

What is a Choropleth Map?

A choropleth map displays statistical data on a map using shading patterns on predetermined geographical areas. Those geographic areas are almost always political boundaries, such country, state, or county borders. They work great for representing variability of a given measurement across a region.

A Sample Choropleth Map Showing New Daily Worldwide COVID-19 Cases on 14 July, 2021

An Overview of Creating a Choropleth Map in Python GeoPandas

The process we’ll be programming in our Python script is breathtakingly simple using GeoPandas.

Read in the US State Boundaries from the GeoJSON file.
Import the COVID-19 data from the CSV file.
Link the data to the state boundaries using the ISO 3166-2 code (state abbreviations)
Plot the data on a choropleth map.

Required Python Dependencies

Before we get started, you’ll need to install four Python modules. You can easily install them using either anaconda or pip. If you have already installed them, you can skip this step.

geopandas
pandas
matplotlib
contextily

The first item in our Python script is to import those four dependencies.

import geopandas
import pandas
import matplotlib.pyplot as plt
import contextily as ctx

Define A Few Constants That We’ll Use Throughout Our Python Script

There are a few values we’ll use throughout the script. Let’s define a few constants so we can easily reference them.

GEOJSON_FILE = "USStates.geojson"
CSV_FILE = "usa-covid-20210102.csv"

# 3857 - Mercator Projection
XBOUNDS = (-1.42e7, -0.72e7)
YBOUNDS = (0.26e7, 0.66e7)

The XBOUNDS and YBOUNDS constants define the bounding box for the map, in the x and y coordinates of the Mercator projections, which we’ll be using in this tutorial. They are not in latitude and longitude. We’ve set them so the left edge of the map is just off the west coast (~127°W) and the right edge is just off the east coast (~65°W). Likewise, the top of the map is just above the US-Canada border (~51°N), and the bottom edge is far enough south (~23°N) to include Florida peninsula and the Keys.

Read in the US State Boundaries Using GeoPandas

GeoPandas is smart enough to be able to automatically figure out the file format of most geometry files, including ESRI Shapefiles and GeoJSON files. As a result, we can load the GeoJSON the exact same way as we loaded the ESRI Shapefiles in previous tutorials.

geojson = geopandas.read_file(GEOJSON_FILE)

Read in Data From the CSV File

You may have noticed that we did not import Python’s built in csv module. That was done intentionally. Instead, we’ll use Pandas to read the CSV.

On the surface, it may look like the main benefit is that you only need a single line of code to read in the CSV data with Pandas. After all, it takes a block of code to do the same with Python’s standard csv library. However, you’ll really reap the benefits in the next step when we go to map the data to the state boundaries.

data = pandas.read_csv(CSV_FILE)

Map the CSV Data to the State Boundaries in the GeoJSON File

When you read the GeoJSON file in with the geopandas.read_file() method, Python stores it as a Pandas DataFrame object. If you were to read in the CSV data using Python’s built-in csv library, Python would store the data as a csv.reader object.

Here’s where the magic happens. By reading in the CSV data with Pandas instead of the built-in csv library, Python also stores the CSV data as a Pandas DataFrame object. If we has used Python’s built-in csv library, mapping the CSV data to the state boundaries would be like trying to combine two recipes, where one was in imperial units, and the other was in metric units.

The Pandas developers built the DataFrame objects to be easily split, merged, and manipulated, which means that once again, we can do it with just a single line of code.

full_dataset = geojson.merge(data, left_on="STATE_ID", right_on="iso3166_2")

Let’s go over what that line of code means.

geojson.merge(data, ... ): Merge the CSV data store in the data variable into the US State boundaries stored in the geojson variable.
left_on="STATE_ID": The property that contains the common unique identifier in the GeoJSON file is called STATE_ID.
right_on="iso3166_2": The property (column) that contains the corresponding unique identifier in the CSV data is called iso3166_2.

The ISO 3166-2 Code: What’s in the Mapping Identifier?

In this tutorial, we’re using each state’s unique ISO 3166-2 code to map the CSV data to the state boundaries in the GeoJSON. So what exactly is an ISO 3166-2 code? It’s a unique code that contains the country code and a unique ID for each state. The International Organization for Standardization, or ISO, maintains a standardized set of codes that every country in the world uses.

In many countries, including the United States and Canada, the ISO 3166-2 codes use the same state and province abbreviations that their respective postal services use. As you’ll see in the table, though, not all countries do.

ISO 3166-2 Code	State/Province	Country
US-CA	California	United States
US-FL	Florida	United States
US-NY	New York	United States
US-TX	Texas	United States
CA-BC	British Columbia	Canada
CA-ON	Ontario	Canada
AU-NSW	New South Wales	Australia
AU-WA	Western Australia	Australia
ZA-MP	Mpumalanga	South Africa
IT-BO	Bologna	Italy
RU-CHE	Chelyabinskaya Oblast	Russia
IN-MH	Maharashtra	India
TH-50	Chaing Mai	Thailand
JP-34	Hiroshima	Japan
FR-13	Bouches-du-Rhône	France
AR-X	Córdoba	Argentina
KG-C	Chuy	Kyrgyzstan

Sample ISO 3166-2 Codes from Various Countries

Write a Function to Generate a Choropleth Map

Once the CSV data has been successfully linked to the state boundaries in the GeoJSON, everything is stored in a single Pandas DataFrame object. As a result, the code to plot the data will be nearly identical to the maps we created in previous GeoPandas tutorials.

Like the tornado track tutorial, you’ll be creating several different maps. To avoid running afoul of the DRY (Don’t Repeat Yourself) principle, let’s put the plotting code into a function that we can call.

First, let’s define the function. We’ll pass it X parameters.

def choropleth_map(mapped_dataset, column, plot_type):

Initialize the Figure

Inside that function, let’s first initialize the figure that will hold our choropleth map.

ax = mapped_dataset.plot(figsize=(12,6), column=column, alpha=0.75, legend=True, cmap="YlGnBu", edgecolor="k"

There’s a lot in this step, so let’s unpack it.

figsize=(12,6): Plot should be 12 inches wide by 6 inches tall
column=column: Plot the column name that was passed to the choropleth_map() function.
alpha=0.75: Make the map 75% opaque (25% transparent) so you can see through it slightly.
legend=True: Include the color bar legend on the figure
cmap="YlGnBu": Use a Yellow-Green-Blue color map
edgecolor="k": Color the state outlines/borders black

Remove Axis Ticks and Labels From Your Choropleth Map

If we use the standard WGS-84 (EPSG:4326) projection to plot the continental US, the map comes out short and wide. For a better aspect ratio, we’ll convert the data into the Mercator Projection (EPSG:3857). Unfortunately, that means the x and y axes will no longer be in latitude and longitude, and will instead be in the coordinates of the Mercator Projection. To avoid any confusion, let’s just hide the labels on the x and y axes.

ax.set_xticks([])
ax.set_yticks([])

Set the Title of Your Choropleth Map

Next, we’ll set the title, exactly like we’ve done in previous tutorials.

title = "COVID-19: {} in the United States\n2 January, 2021".format(title)
ax.set_title(plot_type)

Because we’re only working with a specific date, we’ve hard-coded the date into the function. However, if you’re working with multiple dates, you can easily update the code so that the correct dates display on the maps.

Zoom the Map to Show the Continental United States

Now, let’s set the bounding box to show only the Lower 48.

ax.set_xlim(*XBOUNDS)
ax.set_ylim(*YBOUNDS)

Add the Basemap For Your Choropleth Map

Penultimately, add the basemap for the choropleth map. We’ll use the same Stamen TonerLite basemap that we used in both the Hurricane Dorian Cone of Uncertainty and the maps of the 2011 tornado tracks. We’ll get the projection from the dataset so we don’t have to worry about the basemap and the data being in different projections.

ctx.add_basemap(ax, crs=full_dataset.crs.to_string(), source=ctx.provicers.Stamen.TonerLite, zoom=4)

Save Your Choropleth Map to a png File

Finally, save the plots to a png image file.

output_path = "covid19_{}_usa.png"
plt.savefig(output_path)

Let’s Generate 4 Choropleth Maps

Now that we have our function to generate the choropleth maps, let’s make 4 maps of COVID-19 data on 2 January, 2021, which was the peak of the winter wave in the United States.

New Daily Cases
Total Cumulative Cases
New Daily Deaths
Total Cumulative Deaths

columns_to_plot = [
    "new_cases", 
    "confirmed",
    "new_deaths",
    "dead"
]

plot_types = [
    "New Daily Cases",
    "Total Cumulative Cases",
    "New Daily Deaths",
    "Total Cumulative Deaths",
]

for column, plot_type in zip(columns, plot_types):
    choropleth_map(full_dataset, columns, plot_type)
    print("Successfully Generated Choropleth Map for {}...".format(plot_type))

Let There Be Maps

After running the script, you’ll find 4 choropleth maps in the script directory.

Download the Script and Run It Yourself

We encourage you to download the script from our Bitbucket Repository and run it yourself. Play around with it and see what other kinds of choropleth maps you can come up with.

Conclusion

Region mapping is an incredibly powerful way to efficiently display massive amounts of data on a map. For example, when we load the Canadian provincial data in our COVID-19 map, the combination of region mapping plus the Mapbox Vector tiles has resulted in a 99.997% reduction in the size of the provincial boundary being loaded. These savings are critical to the success of online GIS projects. Nobody in their right mind is going to sit around and wait for 75 GB of state boundaries to download every time the map loads.

Many people think that high-level tasks such as region mapping are confined to tools like ESRI ArcGIS. While Python GeoPandas is certainly not a replacement for a tool like ArcGIS, it’s a perfect solution for organizations that don’t have the budget for expensive software licenses or don’t do enough GIS work to require those licenses. If you’re ready, we can help you get started building maps with GeoPandas today.

If you’re ready to try a few exercises yourself, we’ve got a couple challenges for you.

Next Steps, Challenge 1:

Revisit our tutorial plotting 2011 tornado data. Revise that script so that instead of generating a map of the tornado tracks, you create a choropleth map of the number of tornadoes to strike each state in 2011. I’ll give you a hint to get started. You don’t need to use region mapping for this because the data is already embedded in the shapefile.

Next Steps, Challenge 2:

In the Bitbucket Repository, you’ll find a CSV File of COVID-19 data for each country for 2 January, 2021. Go online and find a GeoJSON or ESRI Shapefile of world country borders. Then use region mapping to create the same 4 choropleth maps we generated in this tutorial, except you should output a map of the world countries, not a map of US States. I’ve included all of the ISO Country Codes in the CSV file so you can use the Alpha 2, Alpha 3, or Numeric codes.

Top Photo: Beautiful Geology in Red Rock Country
Sedona, Arizona – August, 2016

The post Python Tutorial: How to Create a Choropleth Map Using Region Mapping appeared first on Matthew Gove Blog.

The Ultimate in Python Data Processing: How to Create Maps and Graphs from a Single Shapefile

Matt Gove — Fri, 18 Jun 2021 16:00:00 +0000

Last week, using Python GeoPandas, we generated two simple geographic maps from an ESRI Shapefile. After plotting a simple map of the 32 Mexican States, we then layered several shapefiles together to make a map of the Cone of Uncertainty for Hurricane Dorian as she brushed the coast of Florida, Georgia, and the Carolinas.

Today, we’re going to get a bit more advanced. We’re going to look at filtering and color data based on different criteria. Then, we’ll make some bar charts from the data in the shapefile. And we’ll do it all without using a GIS program. Once again, we’ll be looking at weather data. This time, we’ll be mapping tornado data from one of the busiest and deadliest tornado seasons ever.

A Review of the 2011 Tornado Season

We’re going to look at 2011 because it was one of the busiest and also one of the deadliest tornado season on record. In addition to the Super Outbreak across Dixie Alley on 27 April, there were two violent EF-5 tornadoes within 48 hours of each other in late May.

One struck Joplin, Missouri on 22 May, tragically killing 168 people. The other struck Piedmont, Oklahoma on 24 May. That tornado hit the El Reno Mesonet Site, which recorded a wind gust of 151 mph prior to going offline. To this day, that record stands as the strongest wind gust the Oklahoma Mesonet has ever recorded.

Tornado Damage in Moore, Oklahoma Less Than 10 Days After the Horrific EF-5 Tornado on 20 May, 2013

Tornado Track Data

The National Weather Service’s Storm Prediction Center in Norman, Oklahoma handles everything related to tornadoes and severe weather in the United States. In addition to its forecasting and research operations, it maintains an extensive archive of outlooks, storm reports, and more.

In that archive, you can find shapefiles of all tornado, wind, and hail reports since 1950. We’ll use the “Paths” shapefile for tornado data, but you can repeat the exercise with wind and hail data, too. From that file alone, we will generate:

Map of 2011 tornado tracks in the Eastern and Central United States, colored by strength
Zoomed in Map of May, 2011 tornado tracks for Oklahoma, Kansas, Missouri, and Arkansas, colored by strength
Zoomed in map of the 27 April, 2011 Super Outbreak across Mississippi and Alabama, colored by strength

Python Libraries and Dependables

For this exercise, you’ll need to make sure you’ve installed several Python libraries. You can easily install the libraries with pip or anaconda.

GeoPandas
Pandas
Matplotlib
Contextily

Digging into the Raw Data

Before we begin, we need to figure out exactly which parameters are in the Shapefile. Because we don’t have a GIS program, we can use GeoPandas. To list all, columns, run the following Python code.

shp_path = "1950-2018-torn-aspath/1950-2018-torn-aspath.shp"
all_tornadoes = geopandas.read_file(shp_path)
print(all_tornadoes.columns)

This outputs a list of all column names in the shapefile. Because we are filtering by year, and coloring by strength, we are only interested in those columns.

yr is the year
st defines the 2-letter state abbreviation
mag represents the tornado strength, using the Enhanced Fujita Scale

Read in the Data From the Shapefile

Before we dive into the Python code, let’s first import all of the libraries that we’ll need to make our geographic maps and graphs.

import geopandas
import pandas as pd
import matplotlib.pyplot as plt
import contextily as ctx

We’ll import the data the same way did last week.

shp_path = "1950-2018-torn-aspath/1950-2018-torn-aspath.shp"
all_tornadoes = geopandas.read_file(shp_path)

Filtering the Raw Data

Applying a filter to the raw data is incredibly simple. All you need to do is define the filtering criteria, and then apply it. You can do it in a single line of code. However, to make it easier to understand, we’ll do it in two.

Recall our filtering criteria. We’re using data from 2011.

filter_criteria = (all_tornadoes["yr"] == 2011)

Applying the filter is as simple as parsing any other list.

filtered_data = all_tornadoes[filter_criteria]

Finally, don’t forget to convert it to the WGS-84 projection. If you’ve forgotten it, the EPSG code for WGS-84 is 4326.

filtered_data = filtered_data.to_crs(epsg=4326)

Plotting the Data on a Map

Using Python, we’re going to create three geographic maps of the 2011 tornado paths with GeoPandas.

The eastern 2/3 of the United States
Southern Great Plains (Oklahoma, Kansas, Missouri, and Arkansas)
Dixie Alley (Mississippi, Alabama, Georgia, and Tennessee)

Because we’re generating three geographic maps, let’s put the map generation code into a Python function. We’ll call it plot_map, and we’ll pass it 4 arguments.

data: A list of the data we filtered from the shapefile in the previous section
region: One of the three regions defined above. We’ll define them as follows.
- United States
- Tornado Alley
- Dixie Alley
xlim: An optional list or tuple defining the minimum and maximum longitudes to plot, in the format [min, max]. If omitted, the map will be scaled to fit the entire dataset.
ylim: An optional list or tuple defining the minimum and maximum latitudes to plot, in the format [min, max]. If omitted, the map will be scaled to fit the entire dataset.

In Python, we’ll create our geographic maps with a function called plot_map().

def plot_map(data, region, xlim=None, ylim=None):
    # Put Your Code Here

Extra Arguments Needed to Generate the Plot

We’ll use the same methods we used last week to plot the data on a map. First, we’ll plot the data, passing it a few extra parameters compared to last time.

column: Tells Python which column in the shapefile to plot.
- We’ll use the mag column for tornado magnitude.
legend: A boolean argument that tells Python whether or not to display the legend on the map.
- We will include the legend (legend = True)
cmap: The color map to use to shade the data on the map.
- We’ll use the “cool” color map, which will shade the tornado paths from blue (weakest) to pink (strongest).

Put it all together into a single line of code.

ax = data.plot(figsize=(12,6), column="mag", legend=True, cmap="cool")

Zoom In On Certain Areas

To zoom in on our three areas, we’ll need to set the bounding box. All a bounding box does in define the minimum and maximum latitudes and longitudes to include on the map. If the xlim and ylim arguments are passed to the function, let’s use them to set the bounding box.

if xlim:
    ax.set_xlim(*xlim)
if ylim:
    ax.set_ylim(*ylim)

Next, we’ll set the map’s title using the region variable we passed to the function.

title = "2011 Tornado Map: {}".format(region)
ax.set_title(title)

While we’re looking at the region, let’s also use the region to set the output filename for our map. We’ll replace spaces with dashes and make everything lowercase.

fname_region = region.replace(" ", "-").lower()

The final piece of the map to add is the basemap, using the same arguments as the Hurricane Dorian cone of uncertainty. If you need to jiggle your memory bank, here are those parameters again.

ax: The plotted data
crs: The coordinate reference system, or projection
source: The basemap style (we’ll use the Stamen TonerLite style again)

ctx.add_basemap(ax, crs=data.crs.to_string(), source=ctx.providers.Stamen.TonerLite

Finally, just save the map in png format. Incorporate the fname_region variable we defined above to give each file a unique name.

figname = "2011-tornadoes-{}.png".format(fname_region)
plt.savefig(figname)

Create Bar Charts

To further demonstrate the incredible power of Python and GeoPandas, let’s make a few bar graphs using data we’ve parsed from the shapefile. While top-line GIS programs such as ESRI ArcGIS include graphing capabilities, most do not. After this, you’ll be able to create publication-ready bar charts from shapefile data in less than 50 lines of code. And it didn’t cost you a dime.

Let’s create two bar charts.

One with the 10 states that recorded the most tornadoes in 2011
The other with the 10 states that recorded the fewest tornadoes in 2011

Count the Number of Tornadoes that Struck Each State in 2011

Because we’re only looking for raw tornado counts, all we have to do is loop through the state (st) column in the shapefile and count how many times each state appears. First, let’s initialize a dictionary to store our state counts. The dictionary keys will be the 2-digit state abbreviations. For example, to get the number of tornadoes in Kansas, you would simply call state_counts["KS"].

states = filtered_data["st"]
state_counts = dict()

Next, all you have to do is count. Do note that the shapefile data contains data for Puerto Rico and the District of Columbia, so we will skip those because they are not states. Then, if a state is already in the state_counts dictionary, just add one to its count. If it’s not yet in the state_counts dictionary, initialize the count for that state, and set it to 1.

for state in states:
    # Filter Out DC and Puerto Rico
    if state in ["DC", "PR"]:
        continue

    # Count the Data
    if state in state_counts.keys():
        state_counts[state] += 1
    else:
        state_counts[state] = 1

Sort the Counts in Ascending and Descending Order

Sorting data is with Python is easy thanks to the sorted() function. You’ll need to pass the function several arguments.

The raw data to sort. In our case, we’re using the state_counts dictionary.
Which parameter to base the sorting on
The sort function sorts from least to greatest by default, so if we’re sorting from least to greatest, we’ll need to tell Python to reverse the sort order.

fewest_counts = sorted(state_counts.items(), key=lambda x: x[1])[:10]
most_counts = sorted(state_counts.items(), key=lambda x: x[1], reverse=True)[:10]

I know this may look a little confusing, so let’s break down what everything means.

state_counts.items(): A list of key, value tuples of data in the dictionary to sort. For the state_counts dictionary, the tuples would be (state_abbreviation, number_of_2011_tornadoes).
key=lambda x: x[1]: The key argument tells Python on which parameter to sort the data. lambda x: x[1] tells Python to sort based on the second element of each tuple in state_counts.items(), which is the number of tornadoes that occurred in each state in 2011.
reverse=True: The graph of states with the most tornadoes in 2011 sorts the counts from greatest to smallest. Because sorted() sorts from smallest to greatest by default, reverse=True simply tells Python to reverse the order of the sort.
[:10]: Instructs Python to only take the first 10 items of each sorted dataset.

A Function to Generate the Bar Graphs

The most powerful aspect of the GeoPandas library is that it comes with all of the Pandas Data Analysis tools already built into it. As a result, you can perform your entire data analysis from within GeoPandas, regardless of whether or not each piece has a GIS component to it. We’ll use the Pandas library to create our bar graphs. Because we’re creating multiple plots, let’s create a function to generate each plot.

def plot_top10(sorted_list, title):
    # Code Goes Here

Our function requires two articles.

sorted_list is either the fewest_counts or most_counts variable we defined in the previous section
title is the title that goes at the top of the graph

Python’s Data Analysis Libraries Use Parallel Arrays to Plot (x, y) Pairs

Both pandas and matplotlib make heavy use of parallel arrays to define (x, y) coordinates to plot. Converting our (state_abbreviation, number_of_2011_tornadoes) tuples into parallel arrays is easy. We’ll loop through each tuple in the sorted data, put the states into one array, and put the tornado counts into the other.

states = []
num_tornadoes = []

for pair in sorted_list:
    state = pair[0]
    count = pair[2]

    states.append(state)
    num_tornadoes.append(count)

Pandas expects the data to be passed to it in a dictionary. The dictionary keys are the labels that go on each axis. Add key/value pairs to the dictionary for each parameter you’ll be analyzing.

data_frame = {
    "Number of Tornadoes": num_tornadoes,
    "States": states,
}

Next, read the data into Pandas. The index parameter tells Pandas which variable to plot on the independent (x) axis. We want to plot the states variable on the x-axis.

df = pandas.DataFrame(data_frame, index=states)

Then, create the bar chart and label the title and axes. The rot parameter tells Pandas the rotation of each bar, so you can create a horizontal or vertical bar chart. We’re creating a vertical bar chart, so we’ll set it to zero.

ax = df.plot.bar(rot=0, title=title)
ax.set_xlabel("State")
ax.set_ylabel("Number of Tornadoes")

Finally, save the graph to your hard drive in png format.

figname = "{}.png".format(title)
plt.savefig(figname)

Generate the Maps and Bar Graphs

Now that we have defined our Python functions to generate the geographic maps and bar graphs, all we have to do is call them. Before we do that, let’s recall the parameters we have to pass to each function.

The plot_map() function to generate the map requires 4 arguments. If you can’t remember what each argument is, we defined them above.

plot_map(data, region, xlim, ylim)

And for the plot_top10 function to create the bar graphs, we need 2 arguments.

plot_top10(sorted_list, title)

Now, just call the functions. First we’ll do the maps.

# East-Central US
plot_map(filtered_data, "United States", (-110, -70))

# Oklahoma - Kansas - Missouri - Arkansas
plot_map(filtered_data, "Tornado Alley", (-100, -91), (33, 38))

# Dixie Alley
plot_map(filtered_data, "Dixie Alley Super Outbreak", (-95, -81), (29, 37))

And the bar charts…

# Highest Number of Tornadoes
plot_top10(most_counts, "US States with the Most Tornadoes in 2011")

# Fewest Number of Tornadoes
plot_top10(fewest_counts, "US States with the Fewest Tornadoes in 2011")

When you run the scripts, you should get the following output. First, the three maps. You can click on the image to view it full size.

And here are the two bar charts.

Try It Yourself: Download and Run the Python Scripts

As with all of our Python tutorials, feel free to download the scripts from our Bitbucket Repository and run them yourself.

Additional Exercises

If you’re up for an extra challenge, modify the script to plot any or all of the following, or create your own.

2011 Hail and Wind Reports
Plot only significant (EF-3 and above) tornadoes
Display the May, 2013 tornado outbreaks (Moore and El Reno) in Central Oklahoma
Create a map of the 1974 Super Outbreak across the midwest and southern United States
Re-create the same maps and graphs for Canada. You can download Canadian tornado track data from the Environment Canada archives.

Conclusion

If you only need to generate static geographic maps, Python GeoPandas is one of the most powerful tools you can use. It lets you plot nearly any type of geospatial data on a publication-ready map. Even better, it comes with one of the most complete and dynamic data analysis libraries that exists today.

While it’s certainly not a replacement for a full GIS suite like ESRI’s ArcGIS, I highly recommend GeoPandas if you want to avoid expensive licensing fees or have heavy data processing needs that some GIS applications can struggle with.

If you’d like to get started with GeoPandas (or any other GIS application), get in touch today to talk to our GIS experts about your geospatial data analysis. Once you get started with GeoPandas, you’ll be impressed with what you can do with it.

Top Photo: A Wedge Tornado Forms Over an Open Prairie
Chickasha, Oklahoma – May, 2013

The post The Ultimate in Python Data Processing: How to Create Maps and Graphs from a Single Shapefile appeared first on Matthew Gove Blog.

Python GeoPandas: Easily Create Stunning Maps without a GIS Application

Matt Gove — Fri, 11 Jun 2021 16:00:00 +0000

Python is the world’s third most popular programming language. It’s also one of the most versatile languages available today. Not surprisingly, Python has incredible potential in the field of Geographic Information Systems (GIS). That potential has only barely begun to get tapped with libraries like GeoPandas.

In the past, we’ve looked at many different uses for Python, including how to make basic GIS maps. Unfortunately, the Basemap module is quite limiting on its own. My biggest complaint about it is actually how the maps look. It’s far too easy to make low-quality maps that look like they’re stuck in the 1980’s.

Enter Python’s GeoPandas Project

The Python Pandas library is an incredibly powerful data processing tool. Built on the popular numpy and matplotlib libraries, it’s a sleek combination of power, speed, and efficiency. You can easily perform complex data analysis in just a fraction of the time you could with Microsoft Excel or even raw Python.

GeoPandas is an add-on for Pandas that lets you plot geospatial data on a map. In addition to all of the benefits of Pandas, you can create high-quality maps that look incredible in any publication and on any medium. It supports nearly all GIS file formats, including ESRI Shapefiles. And do you know what the best part is? You don’t even need a GIS program to do it.

Today, we’re going to learn how to use GeoPandas to easily make simple, but stunning GIS maps. We’ll generate each map with less than 40 lines of code. The Python code is easy to read and understand, even for a beginner.

Getting Started with GeoPandas

Before getting started, you’ll need to install GeoPandas using either pip or anaconda. You can find the installation instructions from their website below. Please note their warning that pip may not install all of the required dependencies. In that case, you’ll have to install them manually.

Your browser does not support iframes. Please visit the GeoPandas website for detailed installation instructions.

All right, let’s dive in to the fun stuff. As always, you can download the full scripts from the Bitbucket repository.

Exercise #1: Display an ESRI Shapefile on a Map

Before we do any kind of number crunching and data analysis, we need to make sure we can load, read, and plot a shapefile using GeoPandas. In this example, we’ll use a shapefile of Mexican State borders, but you can use any shapefile you desire.

First, let’s import the libraries and modules we’ll be using. In addition to GeoPandas, we’ll be using matplotlib, as well as contextily. The contextily library allows us to set a modern, detailed basemap.

import geopandas
import matplotlib.pyplot as plt
import contextily as ctx

Diving into the code, we’ll first read the shapefile into pandas.

shp_path = "mexico-states/mexstates.shp"
mexico = geopandas.read_file(shp_path)

A Word on Projections

If we were to plot the map right now, we’d run into a major issue. Do you have any guesses as to what that issue might be? The boundaries in the shapefile will not align with the boundaries on the basemap because they use different coordinate reference systems (CRS’s), or projections. In that event, you’ll wind up with a map like this.

What happens when your shapefile is not in the same projection as your basemap.

While the basemap is in the Pseudo-Mercator Projection, the shapefile is in the North American Datum, or NAD-83, projection. The European Petroleum Survey Group (EPSG) maintains a standardized database of all coordinate reference systems. Thankfully, you only need one line of code to convert the shapefile to the Pseudo-Mercator Projection. The EPSG code for the Pseudo-Mercator Projection is 3857.

mexico = mexico.to_crs(epsg=3857)

Plot the State Borders and Basemap

With both the basemap and the shapefile in the same coordinate reference system, we can plot them on the map. First, we’ll do the shapefile. We’ll pass the plot() method three parameters.

Figure Size is 1200×800 pixels
50% Transparency, or alpha=0.5
State borders should be black (edgecolor="k"). Geopandas uses matplotlib syntax to define colors.

ax = mexico.plot(figsize=(12,8), alpha=0.5, edgecolor="k")

Now, let’s use the contextily library to add the basemap. You can set the zoom level of the basemap, but I prefer to let Geopandas figure it out automatically. If you zoom in too much, you can easily crash your Python script.

ctx.add_basemap(ax)

Finally, save the figure to your hard drive.

plt.savefig("mexico-state-borders.png")

There’s still plenty of room for improvement, but our map is off to a great start!

Exercise #2: Use Layers to Map a Hurricane’s Cone of Uncertainty

Layers make GIS, graphic design, and much more incredibly powerful. In this example, let’s stack three layers. We’ll generate a map of the cone of uncertainty of a hurricane. If we can generate the plot quickly, warnings and evacuation orders can be issued, and we can save lives. We’ll look at Hurricane Dorian as it bears down on Florida and the Bahamas in 2019.

The National Hurricane Center GIS Portal

The National Hurricane Center maintains a portal of both live GIS data as well as archives dating back to 2008. While I included the Hurricane Dorian shapefiles in the Bitbucket repository, I encourage you to browse the NHC archives and run our script for other hurricanes. Their file naming system can be a bit cryptic, so you can look up advisory numbers in their graphics archive.

Using the Hurricane Dorian shapefile as an example, here’s how the filename breaks down. The filename is al052019-033_5day_pgn.shp

al: Atlantic Hurricane
05: The fifth storm of the season
2019: The 2019 calendar year
033: NHC Advisory #33
5day: 5-Day Forecast
pgn: The polygon of the cone of uncertainty.

The NHC also provides shapefiles for the center line and points of where the center of the hurricane is forecast to be at each subsequent advisory. We’ll use all three in this example.

Python Code

Like the Mexico map, let’s start by importing the modules and libraries we’ll be using.

import geopandas
import matplotlib.pyplot as plt
import contextily as ctx

There are three components of the cone of uncertainty: the polygon, the center line, and the points where the center of the storm is forecast at each subsequent advisory. Each has its own shapefile. We’ll use a string substitution shortcut here so we don’t have to retype the filename three times. The .format() method substitutes the parameter passed to it into the curly brackets in the filepath.

SHP_PATH = "shp/hurricane-dorian/al052019-033_5day_{}.shp"
polygon_path = SHP_PATH.format("pgn")
line_path = SHP_PATH.format("lin")
point_path = SHP_PATH.format("pts")

Now, read the three shapefiles into GeoPandas.

polygons = geopandas.read_file(polygon_path)
lines = geopandas.read_file(line_path)
points = geopandas.read_file(point_path)

For the projections, we’re going change things up slightly from the Mexico map. Look at the x and y-axis labels of the Mexico map. They’re in the units of the projection instead of latitude and longitude. Instead of using the pseudo-mercator projection let’s use the WGS-84 projection, which uses EPSG code 4326. WGS-84 uses latitude and longitude as its coordinate system, so the axis labels will be latitude and longitude.

polygons = polygons.to_crs(epsg=4326)
lines = lines.to_crs(epsg=4326)
points = points.to_crs(epsg=4326)

Layering the Shapefiles on a Single Map

Before plotting the shapefiles, think about how you may want to color them. Because we’re dealing with a Category 5 hurricane that’s an imminent threat to population centers, let’s shade the cone red. While we’re at it, let’s make the center line and points black so they stand out.

The facecolor parameter defines the color a polygon is shaded. We’ll also make the cone more transparent so you can see the basemap underneath it better. That way, there’s no doubt as to where the storm is heading.

To stack the layers on a single map, define a figure (fig) variable with the initial layer. Then reference that variable to tell GeoPandas to plot each subsequent layer on the same map (ax=fig).

fig = polygons.plot(figsize=(10,12), alpha=0.3, facecolor="r", edgecolor="k")
lines.plot(ax=fig, edgecolor="k")
points.plot(ax=fig, facecolor="k")

Since this map would be published to the public in the real world, let’s spruce it up with a title and axis labels so there are no doubts about our warnings and messaging.

plot_title = "Hurricane Dorian Advisory #33\n11 AM EDT      1 September, 2019"
fig.set_title(plot_title)
fig.set_xlabel("Longitude")
fig.set_ylabel("Latitude")

Correctly Project the Basemap into WGS-84

The last map layer is the basemap. Because the basemap is not in the WGS-84 projection by default, we’ll need to pass that as well. To avoid type-o’s, we’ll reference it from one of the shapefiles that’s already in the WGS-84 projection. We’ll also manually set the zoom level to optimize the size and placement of state and city names on the basemap.

ctx.add_basemap(fig, crs=polygons.crs.to_string(), zoom=7)

Finally, save the figure to your hard drive.

plt.savefig("hurricane-dorian-cone-33.png")

Plenty of Room for Improvement

That’s a perfectly fine map, but I believe we can do better. While the cone itself looks great, the basemap leaves a bit to be desired. City and state labels are hard to read, especially when they’re inside the cone. The basemap looks blurry, no matter the zoom level you set it too.

Additionally, the green terrain draws the eye away from the cone. In emergency situations like a major hurricane making landfall, you want to convey a bit of urgency. The red just doesn’t “pop” off the page.

So how do you fix the map to convey more urgency? Change the basemap. The terrain is too distracting. Ideally, when you look at the map, you should instantly be able to identify the map’s location using the basemap. After that, though, the basemap should “fade” into the background, allowing your reader to focus on the data. Using muted or greyscale colors on the basemap is the best way to accomplish that.

Thankfully, GeoPandas provides a good selection of basemaps. Have a look at the basemaps in that link (at the bottom of the page). Can you identify any basemaps that use muted or greyscale colors? The one that catches my eye is called “Stamen Toner Lite”.

Updating Our Python Code

Updating our script is easy. When you call the add_basemap() method, you can specify which basemap you use by passing it the source parameter.

ctx.add_basemap(fig, crs=polygons.crs.to_string(), source=ctx.providers.Stamen.TonerLite)

After running the script, the difference is striking. It’s amazing the difference just changing a few colors makes.

So how did we do? Instantly identify the location on the map? Check. Eye instantly drawn to the red? Check. The red pops off the page? You bet. For a final comparison, here’s the actual graphic from the National Hurricane Center. I’ll let you decide which map you like best.

Official NHC Advisory for Hurricane Dorian – 11 AM EDT on 1 September, 2019

The Hurricane Center actually provides all of the GIS data and layers to recreate their official advisory graphics. Unfortunately, that’s outside the scope of this tutorial, but we’ll create those maps in a future lesson.

Pro Tip: Use this Python script in real time this hurricane season. You just need to change the shapefile path to the live URL on the National Hurricane Center Website.

Conclusion

It wasn’t too long ago that you needed expensive GIS software to make high-quality, publication-ready figures. Thankfully, those days are forever behind us. With web-based and open-source GIS platforms coming online, geospatial data processing is not only becoming much more affordable. It’s also gotten exponentially more powerful.

This tutorial doesn’t even begin to scratch the surface of what you can do with Python GeoPandas. As a result, we’ll be exploring the GeoPandas tool much more as we go though this summer and into the fall.

In our next tutorial, we’ll be analyzing data from one of the busiest and deadly tornado season in US history. Make sure you come back later this month when that drops. In the meantime, please sign up for our email newsletter to stay on top of industry news. And if you have any questions, please get in touch or leave a comment below. See you next time.

Top Photo: A Secluded Stretch of Beach on the Intracoastal Waterway
St. Petersburg, Florida – March, 2011

The post Python GeoPandas: Easily Create Stunning Maps without a GIS Application appeared first on Matthew Gove Blog.