“BCB 520 - Midterm Portfolio Post”

“Midterm Assginments”

“Midterm 1, finding out how well/bad are we doing with grants”
Midtem
DataViz
Author

“Lucas”

Published

March 28, 2024

Preamble

The main goal of this midterm is to assess the University of Idaho’s performance regarding awarded grants and how it compares to other geographically close universities. The funding agencies are: The National Science Foundation, The National Institutes of Health, The Department of Energy, and The US Department of Agriculture.

We have been provided with several data sets from four different funding agencies. A quick overview of each funding agency is provided below:

1.- Department of Energy (DOE)

Code
library(tidyverse)
Warning: package 'readr' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
Warning: package 'stringr' was built under R version 4.2.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(readxl)
library(dplyr)
library(knitr)
try({
  DOE <- read_xlsx("DOEawards.xlsx")
  
  DOE_UI <- DOE %>% 
    dplyr::filter(Institution == 'Regents of the University of Idaho')
  
  DOE_UIFiltered <- DOE_UI %>%
    select(Title, Institution, PI, Status, `Program Office`, `Start Date`, `End Date`, `Amount Awarded to Date`)
  
  # Display the table
  knitr::kable(head(DOE_UIFiltered))
}, silent = TRUE)
New names:
• `` -> `...27`
Title Institution PI Status Program Office Start Date End Date Amount Awarded to Date
Nuclear Theory at the University of Idaho Regents of the University of Idaho Sammarruca, Francesca Active Office of Nuclear Physics 12/01/2021 11/30/2024 1812000
Converting methoxy groups on lignin-derived aromatics from a toxic hurdle to a useful resource: a systems-driven approach Regents of the University of Idaho Marx, Christopher Active Office of Biological & Environmental Research 09/01/2021 08/31/2024 1404162
Integrative Imaging of Plant Roots during Symbiosis with Mycorrhizal Fungi Regents of the University of Idaho Vasdekis, Andreas Active Office of Biological & Environmental Research 08/15/2021 08/14/2024 1519359
Nutrient and Fine Sediment Transport Driven by Perturbations in River Bed Movement Regents of the University of Idaho Yager, Elowyn Active Office of Biological & Environmental Research 09/01/2020 08/31/2024 603903

2.- The National Institutes of Health (NIH)

Code
library(tidyverse)
library(readxl)
library(dplyr)
library(knitr)

# Attempt to read the Excel file and process it
try({
  # Read the Excel file into a data frame
  NIH <- read_xlsx("NIH.xlsx")
  
  # Filter and select specific columns from the NSF data frame
  NIH_Filtered <- NIH %>%
    select(contact_pi_name, award_amount, budget_start)
  
  # Display the table using knitr::kable
  knitr::kable(head(NIH_Filtered))
}, silent = TRUE)
contact_pi_name award_amount budget_start
MCGUIRE, MICHELLE KAY 2354626 2024-03-11T12:03:00Z
WILLIAMS, JANET E. 461621 2024-01-15T12:01:00Z
CHEN, YIMIN 162320 2024-01-15T12:01:00Z
LANE, GINNY 153322 2024-01-15T12:01:00Z
MCGUIRE, MICHELLE KAY 300000 2024-01-15T12:01:00Z
MCGUIRE, MICHELLE KAY 946131 2024-01-15T12:01:00Z

3.- Department of Agriculture (NIFA)

Code
library(tidyverse)
library(readxl)
library(knitr)
USDA_UI  <- read.csv("USDAtoUI.csv")
knitr::kable(head(USDA_UI))
Award.Date Grant.Number Proposal.Number Grant.Title State.Name Grantee.Name Award.Dollars Program.Name Program.Area.Name
2010-09-30 2010-48679-01200 N/A N/A IDAHO SAES - UNIVERSITY OF IDAHO 7495 N/A N/A
2009-09-30 2009-48679-01200 N/A N/A IDAHO SAES - UNIVERSITY OF IDAHO 6813 N/A N/A
2008-09-30 2008-48679-01200 N/A N/A IDAHO SAES - UNIVERSITY OF IDAHO 8524 N/A N/A
2003-09-30 2003-48604-01200 N/A N/A IDAHO SAES - UNIVERSITY OF IDAHO 1097 N/A N/A
2010-09-30 2010-48024-01200 N/A N/A IDAHO SAES - UNIVERSITY OF IDAHO 11997 N/A N/A
2009-09-30 2009-48024-01200 N/A N/A IDAHO SAES - UNIVERSITY OF IDAHO 14990 N/A N/A

4.- The National Science Foundation (NSF)

Code
library(tidyverse)
library(readxl)
library(dplyr)
library(knitr)

# Attempt to read the Excel file and process it
try({
  # Read the Excel file into a data frame
  NSF <- read_xlsx("NSFtoUI.xlsx")
  
  # Filter and select specific columns from the NSF data frame
  NSF_Filtered <- NSF %>%
    select(pdPIName, startDate, expDate, estimatedTotalAmt)
  
  # Display the table using knitr::kable
  knitr::kable(head(NSF_Filtered))
}, silent = TRUE)
pdPIName startDate expDate estimatedTotalAmt
Dave Lien 04/01/2024 03/31/2026 628415.00
Kristopher V Waynant 04/01/2024 03/31/2027 456051.00
Tara Hudiburg 03/01/2024 08/31/2025 1000000.00
Julie M Amador 12/01/2023 11/30/2027 1179977.00
Esteban A Hernandez Vargas 11/01/2023 10/31/2025 250000.00
Lilian Alessa 10/01/2023 09/30/2027 2435509.00

Data Dictionary table

In general all the data sets shares the same variables with different names. The most commonly used ones are summarized in the table below:

Code
library(tidyverse)
library(readxl)
library(knitr)
datatable  <- read_xlsx("DataDictionary.xlsx")
knitr::kable(head(datatable))
Attribute Description Type
PI Principal Investigator Character
Institution University/instiution name Character
Start date Date where the award started Date
End date Date where the award ends Date
Code Internal code of the award Numeric
Amount Amount of money awarded Numeric

QUESTION 1:

Provide a visualization that shows our active awards from each sponsor. I need to see their start date and end date, the amount of the award, and the name of the Principal Investigator. I’m really interested in seeing how far into the future our current portfolio will exist. Are there a bunch of awards about to expire? Are there a bunch that just got funded and will be active for a while? Does this vary across sponsors?

Lets take a look at historical data:

Code
library(readxl)
library(dplyr)
library(ggplot2)

library(readxl)
library(dplyr)
library(ggplot2)

funding_data <- read_xlsx("UIawards3.xlsx")
New names:
• `` -> `...4`
• `` -> `...6`
Code
# Count the unique PIs per agency
pi_count_per_agency <- funding_data %>%
  group_by(Agency) %>%
  summarise(PI_Count = n_distinct(PI))

# Filter for the specific agencies if the dataset contains more agencies
pi_count_per_agency <- pi_count_per_agency %>%
  filter(Agency %in% c("NSF", "NIH", "DOE"))

# Create the bar chart
ggplot(pi_count_per_agency, aes(x = Agency, y = PI_Count, fill = Agency)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Total number of PIs Per Funding Agency in UofI since 1998",
       x = "Agency",
       y = "Number of PIs") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3")

Code
total_amount_per_agency <- funding_data %>%
  group_by(Agency) %>%
  summarise(TotalAmount = sum(Amount, na.rm = TRUE) / 1e6) # Convert to millions

# Filter for the specific agencies if the dataset contains more agencies
total_amount_per_agency <- total_amount_per_agency %>%
  filter(Agency %in% c("NSF", "NIH", "DOE"))

# Create the bar chart for total funding amount per agency in USD millions
ggplot(total_amount_per_agency, aes(x = Agency, y = TotalAmount, fill = Agency)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Total Funding Amount Per Funding Agency in UofI since 1998",
       x = "Agency",
       y = "Total Funding Amount (USD Millions)") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3")

Since 1998, NSF is the agency with the most PIs in UofI recent history. Looking at the total funding amount per agency, NIH and NSF are pretty close, this tell us that there are a group of PIs that gets constantly funded with large amounts in NIH. DOE is the area where UofI can grow the most, and more effort should be invested into.

Code
library(readxl)
library(dplyr)
library(ggplot2)
library(lubridate)

# Read data from an Excel file
USDA10 <- read_xlsx("USDA10UI.xlsx")
New names:
• `` -> `...4`
• `` -> `...6`
Code
# Convert start to Date objects
USDA10$start <- as.Date(USDA10$start, format = "%m/%d/%Y")

# Handle end date: Assuming you have only the year, create a Date object for December 31st of that year
USDA10$end <- as.Date(paste0(USDA10$end, "-12-31"), format = "%Y-%m-%d")

# Create the Gantt chart
ggplot(USDA10, aes(y = Agency, x = start, xend = end, yend = Agency)) +
  geom_segment(size = 10, color = "green") + # Adjust the size as needed
  scale_x_date(date_breaks = "1 year", date_labels = "%Y", limits = c(as.Date("2021-01-01"), as.Date("2029-01-01"))) +
  labs(title = "10 years of active USDA funding",
       x = "Timeline",
       y = "Agency") +
  theme_minimal() +
  theme(legend.position = "bottom")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning: Removed 166 rows containing missing values (`geom_segment()`).

Code
library(readxl)
library(dplyr)
library(ggplot2)

# Read data from an Excel file
DOE10 <- read_xlsx("DOE10UI.xlsx")
New names:
• `` -> `...4`
• `` -> `...6`
Code
# Convert start and end to Date objects
# Assuming your Excel file has columns named 'start' and 'end'
DOE10$start <- as.Date(DOE10$start, format = "%m/%d/%Y")
DOE10$end <- as.Date(DOE10$end, format = "%m/%d/%Y")

# Create the Gantt chart
ggplot(DOE10, aes(y = Agency, x = start, xend = end, yend = Agency)) +
  geom_segment(size = 20, color = "black") + # Use linewidth instead of size
  scale_x_date(date_breaks = "1 year", date_labels = "%Y", limits = c(as.Date("2021-01-01"), as.Date("2029-01-01"))) +
  labs(title = "10 years of active DOE funding",
       x = "Timeline",
       y = "Agency") +
  theme_minimal() +
  theme(legend.position = "bottom")
Warning: Removed 1 rows containing missing values (`geom_segment()`).

Code
library(readxl)
library(dplyr)
library(ggplot2)
library(lubridate)
library(RColorBrewer)


# Aggregate data to get the total amount per PI and filter out amounts smaller than 200000
total_amount_per_PI <- DOE10 %>%
  group_by(PI) %>%
  summarise(TotalAmount = sum(Amount, na.rm = TRUE)) %>%
  filter(TotalAmount >= 200000) %>%  # Filter step added here
  arrange(desc(TotalAmount))

ggplot(total_amount_per_PI, aes(x = PI, y = TotalAmount / 1e6)) +
  geom_bar(stat = "identity") +
  labs(title = "DOE Total Funding per PI in the Last 10 Years",
       y = "Total Funding Amount (USD Millions)",
       x = "Principal Investigator") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate X-axis labels for readability

Code
library(readxl)
library(dplyr)
library(ggplot2)

# Read data from an Excel file
NSF10 <- read_xlsx("NSF10UI.xlsx")
New names:
• `` -> `...4`
• `` -> `...6`
Code
# Convert start and end to Date objects
# Assuming your Excel file has columns named 'start' and 'end'
NSF10$start <- as.Date(NSF10$start, format = "%m/%d/%Y")
NSF10$end <- as.Date(NSF10$end, format = "%m/%d/%Y")

# Create the Gantt chart
ggplot(NSF10, aes(y = Agency, x = start, xend = end, yend = Agency)) +
  geom_segment(size = 20, color = "red") + # Use linewidth instead of size
  scale_x_date(date_breaks = "1 year", date_labels = "%Y", limits = c(as.Date("2021-01-01"), as.Date("2029-01-01"))) +
  labs(title = "10 years of active NSF funding",
       x = "Timeline",
       y = "Agency") +
  theme_minimal() +
  theme(legend.position = "bottom")
Warning: Removed 81 rows containing missing values (`geom_segment()`).

Code
library(readxl)
library(dplyr)
library(ggplot2)
library(lubridate)
library(RColorBrewer)

NSF10 <- read_xlsx("NSF10UI.xlsx")
New names:
• `` -> `...4`
• `` -> `...6`
Code
# Aggregate data to get the total amount per PI and filter out amounts smaller than 200000
total_amount_per_PI <- NSF10 %>%
  group_by(PI) %>%
  summarise(TotalAmount = sum(Amount, na.rm = TRUE)) %>%
  filter(TotalAmount >= 200000) %>%  # Filter step added here
  arrange(desc(TotalAmount))

ggplot(total_amount_per_PI, aes(x = PI, y = TotalAmount / 1e6)) +
  geom_bar(stat = "identity") +
  labs(title = "NSF Total Funding per PI in the Last 10 Years",
       y = "Total Funding Amount (USD Millions)",
       x = "Principal Investigator") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate X-axis labels for readability

Code
library(ggplot2)
library(dplyr)
library(lubridate)

df <- read_xlsx("UIawards.xlsx")
New names:
• `` -> `...4`
• `` -> `...6`
Code
# Assuming your dataset is named df
# Clean 'Agency' data (remove leading/trailing spaces, check for case sensitivity)
df$Agency <- as.character(trimws(df$Agency))

# Convert 'start' and 'end' columns to Date format
df$start <- as.Date(df$start, format = "%Y-%m-%d")
df$end <- as.Date(df$end, format = "%Y-%m-%d")

# Create a timeline that spans 10 years back and 10 years forward from today
timeline_start <- Sys.Date() - years(10)
timeline_end <- Sys.Date() + years(10)

# Plotting
ggplot(df, aes(y = factor(Agency, levels = unique(Agency)), x = start, xend = end, yend = factor(Agency, levels = unique(Agency)))) +
  geom_segment(size = 20, color = "blue") +  # Adjust size and color as needed
  scale_x_date(limits = c(timeline_start, timeline_end), date_breaks = "1 year", date_labels = "%Y") +
  labs(title = "10 years of Active NIH  Timeline",
       x = "Year",
       y = "Agency") +
  theme_minimal() +
  theme(axis.text.y = element_text(size = 7),
        panel.grid.major.x = element_line(color = "grey80"),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank())
Warning: Removed 74 rows containing missing values (`geom_segment()`).

Code
library(readxl)
library(dplyr)
library(ggplot2)
library(lubridate)
library(RColorBrewer)

NIH10 <- read_xlsx("NIH10UI.xlsx")
New names:
• `` -> `...4`
• `` -> `...6`
Code
# Aggregate data to get the total amount per PI and filter out amounts smaller than 200000
total_amount_per_PI <- NIH10 %>%
  group_by(PI) %>%
  summarise(TotalAmount = sum(Amount, na.rm = TRUE)) %>%
  filter(TotalAmount >= 200000) %>%  # Filter step added here
  arrange(desc(TotalAmount))

ggplot(total_amount_per_PI, aes(x = PI, y = TotalAmount / 1e6)) +
  geom_bar(stat = "identity") +
  labs(title = "NIH Total Funding per PI in the Last 10 Years",
       y = "Total Funding Amount (USD Millions)",
       x = "Principal Investigator") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate X-axis labels for readability

According to the plotted data, NSF and NIH appear to have secured grants into the furthest future, (2028-2029). DOE has the largest grow potential, since just 4 PIs have grants.

QUESTION 3:

Code
library(readxl)
library(dplyr)
library(ggplot2)

BU <- read_xlsx("NIH.xlsx")

# Use 'BU' instead of 'df' for operations
selected_universities <- BU %>%
  filter(organization.org_name %in% c("BOISE STATE UNIVERSITY", "UNIVERSITY OF IDAHO")) %>%
  group_by(organization.org_name) %>%
  summarise(total_award_amount = sum(award_amount, na.rm = TRUE))

# Create the bar chart
ggplot(selected_universities, aes(x = organization.org_name, y = total_award_amount, fill = organization.org_name)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Total Award Amount by University NIH",
       x = "University",
       y = "Total Award Amount NIH") +
  theme_minimal() +
  scale_fill_manual(values = c("BOISE STATE UNIVERSITY" = "blue", "UNIVERSITY OF IDAHO" = "gold"))

Lets focus on USDA grants, since I’m in CALS, it makes sense to take a deeper look into USDA.

Code
# Load necessary libraries
library(readr)
library(dplyr)
library(ggplot2)
library(lubridate)

# Read the CSV file
data <- read_csv("USDAallU.csv")
Rows: 2314 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Award Date, Grant Number, Proposal Number, Grant Title, State Name,...
dbl (1): Award Dollars

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
# Convert 'Award Date' to Date object using correct format and filter data from 2014 to 2024
data <- data %>%
  mutate(AwardDate = mdy(`Award Date`)) %>%
  filter(AwardDate >= as.Date("2014-01-01") & AwardDate <= as.Date("2024-12-31"))

# Aggregate data to get the total Award Dollars for each Grantee Name
total_award_per_grantee <- data %>%
  group_by(`Grantee Name`) %>%
  summarise(TotalAwardDollars = sum(`Award Dollars`, na.rm = TRUE)) %>%
  arrange(desc(TotalAwardDollars)) # Optional: arrange in descending order of total award dollars
# Assuming total_award_per_grantee is already created from previous steps

# Convert Grantee Name to a factor (if not already)
total_award_per_grantee$`Grantee Name` <- factor(total_award_per_grantee$`Grantee Name`)

# Assuming custom_colors is correctly mapped to your Grantee Names
custom_colors <- c("Boise State University" = "blue", 
                   "University of Idaho" = "gold", 
                   "Washington State University" = "red", 
                   "University of Montana" = "tan")


ggplot(total_award_per_grantee, aes(x = reorder(`Grantee Name`, TotalAwardDollars), y = TotalAwardDollars / 1e6, fill = `Grantee Name`)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("Boise State University" = "blue", 
                   "University of Idaho" = "gold", 
                   "Washington State University" = "red", 
                   "University of Montana" = "tan")) +
  labs(title = "USDA Total Award Dollars per University (2014-2024)",
       x = "Univeristy",
       y = "Total Award Dollars (Millions USD)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))  # For better readability of Grantee Names

Code
# Load necessary libraries
library(readr)
library(dplyr)
library(ggplot2)
library(lubridate)

# Read the data
data <- read_csv("USDAallU.csv")
Rows: 2314 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Award Date, Grant Number, Proposal Number, Grant Title, State Name,...
dbl (1): Award Dollars

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
# Prepare the data by grouping and summarising
grants_per_grantee <- data %>%
  group_by(`Grantee Name`) %>%
  summarise(NumberOfGrants = n()) %>%
  arrange(desc(NumberOfGrants))  # Optional: arrange in descending order of number of grants

# Define custom colors for specific Grantee Names
custom_colors <- c("Boise State University" = "blue", 
                   "University of Idaho" = "gold", 
                   "Washington State University" = "red", 
                   "University of Montana" = "tan")

# Create the bar graph with custom colors
ggplot(grants_per_grantee, aes(x = reorder(`Grantee Name`, NumberOfGrants), y = NumberOfGrants, fill = `Grantee Name`)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = custom_colors) +  # Apply custom colors
  labs(title = "Number of Grants per Grantee",
       x = "Grantee Name",
       y = "Number of Grants") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))  # Rotate X-axis labels for better readability

Conclusions

Let me start by saying that the 80/20 rule is completely accurate. I probably spent more than 80% of the time cleaning, merging, and generally preparing the appropriate data sets to produce these few data visualizations. There’s no convention among funding agencies on how they report their data; this would make things way easier for universities and institutions to have an idea of how they are doing in comparison to the rest.

DOE shows the largest growing potential, with just 4 PIs; there’s a big opportunity to grow.

UofI has a solid funding history with NIH with a few anchor PIs that got large total amounts.

In general, UofI is doing well, especially with USDA. Considering its size (around 12,000 students), UofI is comparable to Montana State (10,000 students), and the data shows it; numbers are similar for USDA. WSU (23,000 students) is significantly larger and it makes sense that it got more USDA projects. Boise State (23,000 students) is not an agricultural sciences-based university, but it is as large as WSU, so if they were to get into agricultural sciences, they would certainly be a threat to UofI’s USDA funding.

What I learned is that these types of analyses are way more complicated than they look. Choosing the right variables and presenting the data in a meaningful way is critical to arriving at the right conclusions. Also, since there’s a significant amount of data wrangling involved, the chances of making a mistake can’t be overlooked, so a quality assessment is mandatory to double-check for any possible mistakes.