logo of company

First Set: Web Scraping and Geocoding

Steven Turnbull

August 30, 2024

Scraping the Barrel

There’s been so many times that I’ve wanted to pull location information from a website or file containing a set of addresses. This can be achieved by many hours of manual copy and pasting, a task that may even be impossible depending on the scale of the data. We can save time and sanity by making our computers do the work for us. Websites are (usually) highly organised and structured, making them perfect for programming tasks. In this post, I’ll show you how you can do this.

First, we want to grab data from the website we are interested in. Here’s a screen shot of Tennis NZs website showing where to play. You can see it contains a map with tennis clubs highlighted, but the information isn’t immediately accessible. You could scroll through each entry on the left (>300 clubs) and copy and paste the information out, but that’s not fun.

Map of Tennis Clubs in Auckland

There are two ways we can pull this information. We can use the webscraper package rvest to directly pull the html from the site using read_html(website_url). For our use case ( and to ensure reproducibility), I’ll download a copy of the html directly - simply right click on the page, and click ‘save as’. We’ll have the the file Tennis NZ - Where to Play.html available for us to look at in more detail.

The code below loads this .html file. It then uses tools from the rvest package to grab the specific bits of information we are interested in. By right clicking on a webpage and clicking inspect, you can focus on the parts of the website that you are interested in. In our case, we will be pulling the contents of store-point-results Map of Tennis Clubs in Auckland

Code
library(rvest)
library(dplyr)
library(stringr)
# Read the HTML content from the file
html_content <- read_html(here("inputs","tennis","Tennis NZ - Where to Play.html"))

# Extract addresses from elements with the class 'storepoint-address' from the HTML content
storepoint_address_contents <- html_content |>
  # Select elements with the class 'storepoint-address'
  html_elements('.storepoint-address') |>
  # Extract text content from the selected elements and trim any surrounding whitespace
  html_text(trim = TRUE)

# Also get the tags of elements with the class 'storepoint-name' from the HTML content
tag_text_contents <- html_content |>
  # Select elements with the class 'storepoint-name'
  html_elements('.storepoint-name') |>
  # Extract text content from the selected elements and trim any surrounding whitespace
  html_text(trim = TRUE)

# Combine the extracted name and address data into a data frame
tennis_df <- data.frame(
  # Assign the extracted names to the 'name' column
  name = tag_text_contents,
  # Assign the extracted addresses to the 'storepoint_address' column
  storepoint_address = storepoint_address_contents,
  # Prevent strings from being converted to factors
  stringsAsFactors = FALSE
) 

As happens in 90% of cases, there are some data quality issues. For example, there’s double commas, with no content in between them. Before we can do anything with this data, we will need to fix these issues (geocoding addresses that are broken is futile!).

My code below fixes some of the issues I found in the scraped data. As an aside, regex is a game changer for cleaning data. As a Data Scientist, it’s very rare that I’ll have a day where I haven’t used some form of regex.

Code
tennis_df_clean <- tennis_df |>
   # Add space between a lowercase letter followed by an uppercase letter
  mutate(storepoint_address = str_replace_all(storepoint_address, "([a-z])([A-Z])",paste0("\\1"," ","\\2" ))) |>
  # Add space between a lowercase letter followed by a number
  mutate(storepoint_address = str_replace_all(storepoint_address, "([a-z])([0-9])",paste0("\\1"," ","\\2" ))) |>
  # Replace the abbreviation "Cnr" with "Corner"
  mutate(storepoint_address = str_replace_all(storepoint_address, "Cnr","Corner")) |>
  # Remove any spaces that appear before commas
  mutate(storepoint_address = str_replace_all(storepoint_address, " ,","")) |>
  # Replace any double commas with a single comma
  mutate(storepoint_address = str_replace_all(storepoint_address, ",,",",")) |>
  # Replace the repeated "Remuera Remuera Remuera" with a single "Remuera"
  mutate(storepoint_address = str_replace_all(storepoint_address, "Remuera Remuera Remuera","Remuera")) 

Now we have our clean scraped data, we can focus on geocoding!

Geocode with one Simple Step

Webscraping is a simple skill opens up a world of possibilies for geospatial analysis. With the set of addresses at hand, we can add on an additional step to augment our newly pulled data with additional information. In our case, unfortutely we were unable to find any data pertaining to longitude and latidude of tennis clubs. Luckily, the tidygeocoder package makes it incredibly straight forward to get this information.

Here’s a simple chunk of code that will go through a dataset of addresses and grab the longitude and latitude.

Code
library(tidygeocoder)

#This is where we will store our results
coord_main <- data.frame()

#Iterate through each row of the tennis data
for(i in 1:nrow(tennis_df_clean)){
  #We'll set our system to sleep before each run.
  #This is to be kind to Nominatim and not bombard them with requests
  Sys.sleep(1)

  #Grab the specific row of the data
  df_i<-tennis_df_clean[i,]
  
  #© openrouteservice.org by HeiGIT | Map data © OpenStreetMap contributors
  # Use the geocode function to get latitude and longitude for the address in the 'storepoint_address' column
  coords_df <- df_i |> 
    # Geocode the addresses using the Nominatim (osm) method
    geocode(address = storepoint_address, method = 'osm') |>
    # Rename the columns 'lat' and 'long' to 'latitude' and 'longitude' respectively
    rename(latitude = lat, longitude = long)

  # Save the combined data frame to a CSV file. 
  # Using append means that it will write each line 1 at a time, saving our work as we go.
  # This is useful if you have a big batch of addresses as it means you're less likely to lose work if the process breaks halfway through (API calls can be costly). 
  write.table(
    coords_df, # Data frame to write
    here("inputs","tennis","tennis_coords.csv"), # File to write to.  
    sep = ",", # Use a comma as the separator
    col.names = !file.exists(here("inputs","tennis","tennis_coords.csv")), # Include column names only if the file doesn't already exist
    append = TRUE, # Append the data to the file if it already exists
    row.names = FALSE # Do not write row names
  )

  # Combine the new data with the main coordinates data frame
  coord_main <- bind_rows(coord_main, coords_df)

  # Print a message to the console indicating the completion of processing for each address
  cat("\n", coords_df$name, " complete. (", coords_df$latitude, ", ", df_save$longitude, ")")
}

And here’s the results!

Note, some addresses may not get a result, it depends on the quality of the addresses and the geocoding service that you use. In our case, we know the data is patchy and I’ve used open street map to geocode. Other services can be found here.

The Fruits of Our (minimal) Labour

We’ve managed to pull publically available address data, use a geocoder to get the longitude and latitude coordinates, now it time to see what we’ve got. Through a combination of ggplot2 and ggiraph, we are able to quickly visualise the newly process data points on an interactive map of Auckland. See the code below.

Code
library(sf)
library(ggplot2)
library(ggiraph)
#load our tennis club coordinates data
coords <- read.csv(here("inputs","tennis","tennis_coords.csv")) |>
  filter(!is.na(latitude)|!is.na(longitude)) |>
  mutate(name = str_replace_all(name,"'",""))
#here's a shapefile for NZ that I prepared earlier
nz_map <- read_sf(here("inputs","shapefiles","NZ_res01.shp"))

g <- ggplot() +
  # Add our map of NZ
  geom_sf(
    data = nz_map,
    colour = "black",
    linewidth = 0.2
  ) +
    # Add the tennis club locations.
  #ggiraph makes things interactive.
  geom_point_interactive(
    data = coords,
    aes(
      x = longitude,
      y = latitude,
      tooltip = name,
      data_id = name
      ),
    hover_nearest = T,
    shape=21,
    size=10,
    colour = "white",
    fill = "#049030",
    alpha=0.7
    ) +
    # Clip the map just to the main Auckland Area
  coord_sf(
    xlim = c(174.5, 175.2), 
    ylim = c(-37.1, -36.6) 
  ) +
    #Style
  theme_void() +
  theme(title = element_text(size=20)) +
  #Title
  ggtitle("Tennis Courts in Auckland, New Zealand")

# wrap it in the girafe function which makes it interactive
girafe(
  ggobj = g,
  # set options for styling
    options = list(
    opts_hover(css = "fill:yellow;stroke:black;stroke-width:3px;")
    )
  )

That’s a wrap. Now that we have our data, there’s a huge number of analysis opportunities. We could look at underlying population demographics to understand access, combine it with drive times data to understand potential catchment areas,build up our visualisation to be fully customisable and interactive. At the very least, hopefully this post will save someone some time away from the computer, manually copying and pasting addresses from a website into google maps!