A worldmap for applicants to the BVBD course at UofI

Maps!
Portfolio
DataViz
Final
IHHE
Author

Lucas

Published

April 30, 2024

Preamble

Since 2021, I have served on the advisory committee as a student representative at the Institute for Health in the Human Ecosystem (IHHE). During this time, I have witnessed the growing popularity of the Biology of Vector-borne Diseases (BVBD) course and a significant increase in the number of applicants over the years. Thus, a logical idea emerged (for which I must credit Dr. Barrie Robison) to visualize these data on a world map.

Institute for Health in the Human Ecosystem (IHHE) logo.

Here’s a description of the Biology of Vector-borne Diseases course that’s hosted at the University of Idaho every year, taken from the IHHE webpage:

“The Institute for Health in the Human Ecosystem hosts the annual Biology of Vector-borne Diseases six-day course. This course provides accessible, condensed training and”knowledge networking” for advanced graduate students, postdoctoral fellows, faculty and professionals to ensure competency in basic biology, current trends and developments, and practical knowledge for U.S. and global vector-borne diseases of plants, animals and humans. We seek to train the next generation of scientists and help working professionals to more effectively address current and emerging threats with holistic approaches and a strong network of collaborators and mentors.”

Here’s a link to the IHHE website:

https://www.uidaho.edu/research/entities/ihhe/education/vector-borne-diseases

So, as part of the organizing advisory committee, we have been considering how to examine the evolution and popularity of the course applicants over the last six years. A suitable visualization would be a world map displaying the home countries of the applicants, year by year.

Data

The data set contains two variables. Year and Country. Year is the date when the course was offered, and Country is where the applicant is applying from, which in the end is used as spatial data. In the table below you can take a look at the data structure.

Code
library(readxl)
library(dplyr)
library(knitr)

participants <- readxl::read_excel("BVBD Applicants Merged.xlsx")
knitr::kable(head(participants))
Year Country
2018 United States of America
2018 Nigeria
2018 United States of America
2018 United States of America
2018 United States of America
2018 United States of America

Visualizations

First, let’s start with simple maps. As we have simple data, without geometry/coordinates, it was necessary to convert Country data names into coordinates in order to plot them into the map.

Code
library(rnaturalearth)
library(ggplot2)
library(dplyr)
library(sf)
library(readxl) 
participants <- readxl::read_excel("BVBD Applicants Merged.xlsx")
world <- ne_countries(scale = "medium", returnclass = "sf")
world$centroid_lon <- st_coordinates(st_centroid(world$geometry))[, 1]
world$centroid_lat <- st_coordinates(st_centroid(world$geometry))[, 2]
participants_with_centroids <- participants %>%
  left_join(world %>% select(name, centroid_lon, centroid_lat), by = c("Country" = "name"))
ggplot(data = participants_with_centroids) +
  borders("world", colour = "gray", fill = "lightgray") + # Draw the base map
  geom_point(aes(x = centroid_lon, y = centroid_lat, color = as.factor(Year), shape = as.factor(Year)), size = 2) + # Plot points with varying colors and shapes per Year
  scale_color_viridis_d(name = "Year") + # Use a nice color scale for colors
  scale_shape_manual(name = "Year", values = c(16, 17, 18, 19, 15, 8)) + # Manually set shapes, adjust values as needed based on the number of years
  theme_minimal() +
  labs(title = "Applicants by Country and Year", x = "Longitude", y = "Latitude", color = "Year", shape = "Year") +
  theme(legend.position = "bottom")

Figure 1. Distribution of applicants by country and year.

Code
# Assuming 'participants' has a 'Country' column
# Count the number of applicants per country
applicants_summary <- participants %>%
  group_by(Country) %>%
  summarise(Applicants = n())

# Join with world data to include geometry
world_with_applicants <- world %>%
  left_join(applicants_summary, by = c("name" = "Country"))

# Plot with a gradient fill based on the number of applicants
ggplot(data = world_with_applicants) +
  geom_sf(aes(fill = Applicants), color = "white") +
  scale_fill_viridis_c(name = "Number of Applicants", na.value = "lightgrey", 
                       limits = c(1, 150), oob = scales::squish) +  # Set the range from 1 to 150
  labs(title = "Global Distribution of Applicants") +
  theme_minimal()

Figure 2. Global distribution of applicants.

Figure 1 uses both colors and shapes, representing two different channels, to differentiate between years. However, there’s an issue of over-plotting that could be resolved by creating one map for each year. Figure 2 serves a different purpose; it displays the distribution of the total number of applicants over the six years that the course has been offered.

Ok, let’s try an interactive map, using the total number of applicants for the six years that the course has been offered. This map let us zoom in, see the name of each country and number of applicants per country.

Code
library(dplyr)
library(sf)
library(rnaturalearth)
library(leaflet)
library(viridis)
library(readxl)

participants <- readxl::read_excel("BVBD Applicants Merged.xlsx")

# Count the number of applicants per country
applicants_summary <- participants %>%
  group_by(Country) %>%
  summarise(Applicants = n()) %>%
  ungroup()

# Get world countries' spatial data
world <- ne_countries(scale = "medium", returnclass = "sf")

# Merge applicants data with spatial data
world_with_applicants <- world %>%
  left_join(applicants_summary, by = c("name" = "Country"))

# Assuming the majority of your data points are in the lower range (e.g., 1-10),
# and you want to ensure these differences are visible, adjust the domain accordingly
# Here, we manually define the domain to ensure the lower counts are differentiated
min_applicants <- min(world_with_applicants$Applicants, na.rm = TRUE)
max_applicants <- max(world_with_applicants$Applicants, na.rm = TRUE)

# Define a color palette function based on the number of applicants
qpal <- colorNumeric(palette = "viridis", domain = c(min_applicants, max_applicants), na.color = "lightgrey")

# Create the leaflet map with enhanced highlight functionality
leaflet(world_with_applicants) %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addPolygons(fillColor = ~qpal(Applicants),
              color = "#2b2b2b",  # Default outline color
              fillOpacity = 0.7,
              weight = 0.5,  # Default outline weight
              popup = ~paste(name, "<br>", "Applicants: ", Applicants),
              highlightOptions = highlightOptions(weight = 5,  # Thicker outline when highlighted
                                                  color = "#FFFF00",  # Bright yellow outline when highlighted
                                                  fillOpacity = 0.7,
                                                  bringToFront = TRUE)) %>%
  addLegend(pal = qpal, values = ~Applicants,
            title = "Number of Applicants",
            position = "bottomright") %>%
  setView(lat = 0, lng = 0, zoom = 2)

Figure 3. Interactive map of global distribution of applicants, with information per country.

Let’s dive even further. Let’s try a 3D world map.

Code
library(plotly)
library(dplyr)
library(countrycode)

# Aggregating the number of applicants per country
applicants_summary <- participants %>%
  group_by(Country) %>%
  summarise(Applicants = n()) %>%
  ungroup()

# Adding ISO country codes to 'applicants_summary' for Plotly
applicants_summary$CODE <- countrycode(applicants_summary$Country, "country.name", "iso3c")

# Handling missing or incorrect country names
applicants_summary <- na.omit(applicants_summary)

# Specify map projection and options
g <- list(
     showframe = FALSE,
     showcoastlines = FALSE,
     projection = list(type = 'orthographic'),
     resolution = '100',
     showcountries = TRUE,
     countrycolor = '#d1d1d1',
     showocean = TRUE,
     oceancolor = '#c9d2e0',
     showlakes = TRUE,
     lakecolor = '#99c0db',
     showrivers = TRUE,
     rivercolor = '#99c0db')

# Plotting with simplified trace
p <- plot_geo(applicants_summary, locations = ~CODE, text = ~paste(Country, '<br>Applicants: ', Applicants),
              marker = list(size = ~sqrt(Applicants) * 2, color = ~Applicants, colorscale = 'Reds', line = list(color = "#d1d1d1", width = 0.5))) %>%
     colorbar(title = 'Number of Applicants') %>%
     layout(title = 'Number of Applicants Per Country', geo = g)

p

Figure 4. Interactive map (3D) of global distribution of applicants, per country.

Conclusions

The type of visualization certainly depends on the purpose it is going to serve. For example, Figures 1 and 2 are best suited for a publication or poster, where interactivity is not possible. For a webpage, Figures 3 and 4 are more appealing and draw more attention from the viewer, enabling the possibility to examine specific countries with precise data presented upon request.