“BCB 520 - Midterm Portfolio Post”

Preamble

The main goal of this midterm is to assess the University of Idaho’s performance regarding awarded grants and how it compares to other geographically close universities. The funding agencies are: The National Science Foundation, The National Institutes of Health, The Department of Energy, and The US Department of Agriculture.

We have been provided with several data sets from four different funding agencies. A quick overview of each funding agency is provided below:

1.- Department of Energy (DOE)

Code

library(tidyverse)

Warning: package 'readr' was built under R version 4.2.3

Warning: package 'dplyr' was built under R version 4.2.3

Warning: package 'stringr' was built under R version 4.2.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Code

library(readxl)
library(dplyr)
library(knitr)
try({
  DOE <- read_xlsx("DOEawards.xlsx")
  
  DOE_UI <- DOE %>% 
    dplyr::filter(Institution == 'Regents of the University of Idaho')
  
  DOE_UIFiltered <- DOE_UI %>%
    select(Title, Institution, PI, Status, `Program Office`, `Start Date`, `End Date`, `Amount Awarded to Date`)
  
  # Display the table
  knitr::kable(head(DOE_UIFiltered))
}, silent = TRUE)

New names:
• `` -> `...27`

Title	Institution	PI	Status	Program Office	Start Date	End Date	Amount Awarded to Date
Nuclear Theory at the University of Idaho	Regents of the University of Idaho	Sammarruca, Francesca	Active	Office of Nuclear Physics	12/01/2021	11/30/2024	1812000
Converting methoxy groups on lignin-derived aromatics from a toxic hurdle to a useful resource: a systems-driven approach	Regents of the University of Idaho	Marx, Christopher	Active	Office of Biological & Environmental Research	09/01/2021	08/31/2024	1404162
Integrative Imaging of Plant Roots during Symbiosis with Mycorrhizal Fungi	Regents of the University of Idaho	Vasdekis, Andreas	Active	Office of Biological & Environmental Research	08/15/2021	08/14/2024	1519359
Nutrient and Fine Sediment Transport Driven by Perturbations in River Bed Movement	Regents of the University of Idaho	Yager, Elowyn	Active	Office of Biological & Environmental Research	09/01/2020	08/31/2024	603903

2.- The National Institutes of Health (NIH)

Code

library(tidyverse)
library(readxl)
library(dplyr)
library(knitr)

# Attempt to read the Excel file and process it
try({
  # Read the Excel file into a data frame
  NIH <- read_xlsx("NIH.xlsx")
  
  # Filter and select specific columns from the NSF data frame
  NIH_Filtered <- NIH %>%
    select(contact_pi_name, award_amount, budget_start)
  
  # Display the table using knitr::kable
  knitr::kable(head(NIH_Filtered))
}, silent = TRUE)

contact_pi_name	award_amount	budget_start
MCGUIRE, MICHELLE KAY	2354626	2024-03-11T12:03:00Z
WILLIAMS, JANET E.	461621	2024-01-15T12:01:00Z
CHEN, YIMIN	162320	2024-01-15T12:01:00Z
LANE, GINNY	153322	2024-01-15T12:01:00Z
MCGUIRE, MICHELLE KAY	300000	2024-01-15T12:01:00Z
MCGUIRE, MICHELLE KAY	946131	2024-01-15T12:01:00Z

3.- Department of Agriculture (NIFA)

Code

library(tidyverse)
library(readxl)
library(knitr)
USDA_UI  <- read.csv("USDAtoUI.csv")
knitr::kable(head(USDA_UI))

Award.Date	Grant.Number	Proposal.Number	Grant.Title	State.Name	Grantee.Name	Award.Dollars	Program.Name	Program.Area.Name
2010-09-30	2010-48679-01200	N/A	N/A	IDAHO	SAES - UNIVERSITY OF IDAHO	7495	N/A	N/A
2009-09-30	2009-48679-01200	N/A	N/A	IDAHO	SAES - UNIVERSITY OF IDAHO	6813	N/A	N/A
2008-09-30	2008-48679-01200	N/A	N/A	IDAHO	SAES - UNIVERSITY OF IDAHO	8524	N/A	N/A
2003-09-30	2003-48604-01200	N/A	N/A	IDAHO	SAES - UNIVERSITY OF IDAHO	1097	N/A	N/A
2010-09-30	2010-48024-01200	N/A	N/A	IDAHO	SAES - UNIVERSITY OF IDAHO	11997	N/A	N/A
2009-09-30	2009-48024-01200	N/A	N/A	IDAHO	SAES - UNIVERSITY OF IDAHO	14990	N/A	N/A

4.- The National Science Foundation (NSF)

Code

library(tidyverse)
library(readxl)
library(dplyr)
library(knitr)

# Attempt to read the Excel file and process it
try({
  # Read the Excel file into a data frame
  NSF <- read_xlsx("NSFtoUI.xlsx")
  
  # Filter and select specific columns from the NSF data frame
  NSF_Filtered <- NSF %>%
    select(pdPIName, startDate, expDate, estimatedTotalAmt)
  
  # Display the table using knitr::kable
  knitr::kable(head(NSF_Filtered))
}, silent = TRUE)

pdPIName	startDate	expDate	estimatedTotalAmt
Dave Lien	04/01/2024	03/31/2026	628415.00
Kristopher V Waynant	04/01/2024	03/31/2027	456051.00
Tara Hudiburg	03/01/2024	08/31/2025	1000000.00
Julie M Amador	12/01/2023	11/30/2027	1179977.00
Esteban A Hernandez Vargas	11/01/2023	10/31/2025	250000.00
Lilian Alessa	10/01/2023	09/30/2027	2435509.00

Data Dictionary table

In general all the data sets shares the same variables with different names. The most commonly used ones are summarized in the table below:

Code

library(tidyverse)
library(readxl)
library(knitr)
datatable  <- read_xlsx("DataDictionary.xlsx")
knitr::kable(head(datatable))

Attribute	Description	Type
PI	Principal Investigator	Character
Institution	University/instiution name	Character
Start date	Date where the award started	Date
End date	Date where the award ends	Date
Code	Internal code of the award	Numeric
Amount	Amount of money awarded	Numeric

QUESTION 1:

Provide a visualization that shows our active awards from each sponsor. I need to see their start date and end date, the amount of the award, and the name of the Principal Investigator. I’m really interested in seeing how far into the future our current portfolio will exist. Are there a bunch of awards about to expire? Are there a bunch that just got funded and will be active for a while? Does this vary across sponsors?

Lets take a look at historical data:

Code

library(readxl)
library(dplyr)
library(ggplot2)

library(readxl)
library(dplyr)
library(ggplot2)

funding_data <- read_xlsx("UIawards3.xlsx")

New names:
• `` -> `...4`
• `` -> `...6`

Code

# Count the unique PIs per agency
pi_count_per_agency <- funding_data %>%
  group_by(Agency) %>%
  summarise(PI_Count = n_distinct(PI))

# Filter for the specific agencies if the dataset contains more agencies
pi_count_per_agency <- pi_count_per_agency %>%
  filter(Agency %in% c("NSF", "NIH", "DOE"))

# Create the bar chart
ggplot(pi_count_per_agency, aes(x = Agency, y = PI_Count, fill = Agency)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Total number of PIs Per Funding Agency in UofI since 1998",
       x = "Agency",
       y = "Number of PIs") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3")

Code

total_amount_per_agency <- funding_data %>%
  group_by(Agency) %>%
  summarise(TotalAmount = sum(Amount, na.rm = TRUE) / 1e6) # Convert to millions

# Filter for the specific agencies if the dataset contains more agencies
total_amount_per_agency <- total_amount_per_agency %>%
  filter(Agency %in% c("NSF", "NIH", "DOE"))

# Create the bar chart for total funding amount per agency in USD millions
ggplot(total_amount_per_agency, aes(x = Agency, y = TotalAmount, fill = Agency)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Total Funding Amount Per Funding Agency in UofI since 1998",
       x = "Agency",
       y = "Total Funding Amount (USD Millions)") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3")

Since 1998, NSF is the agency with the most PIs in UofI recent history. Looking at the total funding amount per agency, NIH and NSF are pretty close, this tell us that there are a group of PIs that gets constantly funded with large amounts in NIH. DOE is the area where UofI can grow the most, and more effort should be invested into.

Code

library(readxl)
library(dplyr)
library(ggplot2)
library(lubridate)

# Read data from an Excel file
USDA10 <- read_xlsx("USDA10UI.xlsx")

New names:
• `` -> `...4`
• `` -> `...6`

Code

# Convert start to Date objects
USDA10$start <- as.Date(USDA10$start, format = "%m/%d/%Y")

# Handle end date: Assuming you have only the year, create a Date object for December 31st of that year
USDA10$end <- as.Date(paste0(USDA10$end, "-12-31"), format = "%Y-%m-%d")

# Create the Gantt chart
ggplot(USDA10, aes(y = Agency, x = start, xend = end, yend = Agency)) +
  geom_segment(size = 10, color = "green") + # Adjust the size as needed
  scale_x_date(date_breaks = "1 year", date_labels = "%Y", limits = c(as.Date("2021-01-01"), as.Date("2029-01-01"))) +
  labs(title = "10 years of active USDA funding",
       x = "Timeline",
       y = "Agency") +
  theme_minimal() +
  theme(legend.position = "bottom")

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Warning: Removed 166 rows containing missing values (`geom_segment()`).

Code

library(readxl)
library(dplyr)
library(ggplot2)

# Read data from an Excel file
DOE10 <- read_xlsx("DOE10UI.xlsx")

New names:
• `` -> `...4`
• `` -> `...6`

Code

# Convert start and end to Date objects
# Assuming your Excel file has columns named 'start' and 'end'
DOE10$start <- as.Date(DOE10$start, format = "%m/%d/%Y")
DOE10$end <- as.Date(DOE10$end, format = "%m/%d/%Y")

# Create the Gantt chart
ggplot(DOE10, aes(y = Agency, x = start, xend = end, yend = Agency)) +
  geom_segment(size = 20, color = "black") + # Use linewidth instead of size
  scale_x_date(date_breaks = "1 year", date_labels = "%Y", limits = c(as.Date("2021-01-01"), as.Date("2029-01-01"))) +
  labs(title = "10 years of active DOE funding",
       x = "Timeline",
       y = "Agency") +
  theme_minimal() +
  theme(legend.position = "bottom")

Warning: Removed 1 rows containing missing values (`geom_segment()`).

Code

library(readxl)
library(dplyr)
library(ggplot2)
library(lubridate)
library(RColorBrewer)


# Aggregate data to get the total amount per PI and filter out amounts smaller than 200000
total_amount_per_PI <- DOE10 %>%
  group_by(PI) %>%
  summarise(TotalAmount = sum(Amount, na.rm = TRUE)) %>%
  filter(TotalAmount >= 200000) %>%  # Filter step added here
  arrange(desc(TotalAmount))

ggplot(total_amount_per_PI, aes(x = PI, y = TotalAmount / 1e6)) +
  geom_bar(stat = "identity") +
  labs(title = "DOE Total Funding per PI in the Last 10 Years",
       y = "Total Funding Amount (USD Millions)",
       x = "Principal Investigator") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate X-axis labels for readability

Code

library(readxl)
library(dplyr)
library(ggplot2)

# Read data from an Excel file
NSF10 <- read_xlsx("NSF10UI.xlsx")

New names:
• `` -> `...4`
• `` -> `...6`

Code

# Convert start and end to Date objects
# Assuming your Excel file has columns named 'start' and 'end'
NSF10$start <- as.Date(NSF10$start, format = "%m/%d/%Y")
NSF10$end <- as.Date(NSF10$end, format = "%m/%d/%Y")

# Create the Gantt chart
ggplot(NSF10, aes(y = Agency, x = start, xend = end, yend = Agency)) +
  geom_segment(size = 20, color = "red") + # Use linewidth instead of size
  scale_x_date(date_breaks = "1 year", date_labels = "%Y", limits = c(as.Date("2021-01-01"), as.Date("2029-01-01"))) +
  labs(title = "10 years of active NSF funding",
       x = "Timeline",
       y = "Agency") +
  theme_minimal() +
  theme(legend.position = "bottom")

Warning: Removed 81 rows containing missing values (`geom_segment()`).

Code

library(readxl)
library(dplyr)
library(ggplot2)
library(lubridate)
library(RColorBrewer)

NSF10 <- read_xlsx("NSF10UI.xlsx")

New names:
• `` -> `...4`
• `` -> `...6`

Code

# Aggregate data to get the total amount per PI and filter out amounts smaller than 200000
total_amount_per_PI <- NSF10 %>%
  group_by(PI) %>%
  summarise(TotalAmount = sum(Amount, na.rm = TRUE)) %>%
  filter(TotalAmount >= 200000) %>%  # Filter step added here
  arrange(desc(TotalAmount))

ggplot(total_amount_per_PI, aes(x = PI, y = TotalAmount / 1e6)) +
  geom_bar(stat = "identity") +
  labs(title = "NSF Total Funding per PI in the Last 10 Years",
       y = "Total Funding Amount (USD Millions)",
       x = "Principal Investigator") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate X-axis labels for readability

Code

library(ggplot2)
library(dplyr)
library(lubridate)

df <- read_xlsx("UIawards.xlsx")

New names:
• `` -> `...4`
• `` -> `...6`

Code

# Assuming your dataset is named df
# Clean 'Agency' data (remove leading/trailing spaces, check for case sensitivity)
df$Agency <- as.character(trimws(df$Agency))

# Convert 'start' and 'end' columns to Date format
df$start <- as.Date(df$start, format = "%Y-%m-%d")
df$end <- as.Date(df$end, format = "%Y-%m-%d")

# Create a timeline that spans 10 years back and 10 years forward from today
timeline_start <- Sys.Date() - years(10)
timeline_end <- Sys.Date() + years(10)

# Plotting
ggplot(df, aes(y = factor(Agency, levels = unique(Agency)), x = start, xend = end, yend = factor(Agency, levels = unique(Agency)))) +
  geom_segment(size = 20, color = "blue") +  # Adjust size and color as needed
  scale_x_date(limits = c(timeline_start, timeline_end), date_breaks = "1 year", date_labels = "%Y") +
  labs(title = "10 years of Active NIH  Timeline",
       x = "Year",
       y = "Agency") +
  theme_minimal() +
  theme(axis.text.y = element_text(size = 7),
        panel.grid.major.x = element_line(color = "grey80"),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank())

Warning: Removed 74 rows containing missing values (`geom_segment()`).

Code

library(readxl)
library(dplyr)
library(ggplot2)
library(lubridate)
library(RColorBrewer)

NIH10 <- read_xlsx("NIH10UI.xlsx")

New names:
• `` -> `...4`
• `` -> `...6`

Code

# Aggregate data to get the total amount per PI and filter out amounts smaller than 200000
total_amount_per_PI <- NIH10 %>%
  group_by(PI) %>%
  summarise(TotalAmount = sum(Amount, na.rm = TRUE)) %>%
  filter(TotalAmount >= 200000) %>%  # Filter step added here
  arrange(desc(TotalAmount))

ggplot(total_amount_per_PI, aes(x = PI, y = TotalAmount / 1e6)) +
  geom_bar(stat = "identity") +
  labs(title = "NIH Total Funding per PI in the Last 10 Years",
       y = "Total Funding Amount (USD Millions)",
       x = "Principal Investigator") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate X-axis labels for readability

According to the plotted data, NSF and NIH appear to have secured grants into the furthest future, (2028-2029). DOE has the largest grow potential, since just 4 PIs have grants.

QUESTION 3:

Code

library(readxl)
library(dplyr)
library(ggplot2)

BU <- read_xlsx("NIH.xlsx")

# Use 'BU' instead of 'df' for operations
selected_universities <- BU %>%
  filter(organization.org_name %in% c("BOISE STATE UNIVERSITY", "UNIVERSITY OF IDAHO")) %>%
  group_by(organization.org_name) %>%
  summarise(total_award_amount = sum(award_amount, na.rm = TRUE))

# Create the bar chart
ggplot(selected_universities, aes(x = organization.org_name, y = total_award_amount, fill = organization.org_name)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Total Award Amount by University NIH",
       x = "University",
       y = "Total Award Amount NIH") +
  theme_minimal() +
  scale_fill_manual(values = c("BOISE STATE UNIVERSITY" = "blue", "UNIVERSITY OF IDAHO" = "gold"))

Lets focus on USDA grants, since I’m in CALS, it makes sense to take a deeper look into USDA.

Code

# Load necessary libraries
library(readr)
library(dplyr)
library(ggplot2)
library(lubridate)

# Read the CSV file
data <- read_csv("USDAallU.csv")

Rows: 2314 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Award Date, Grant Number, Proposal Number, Grant Title, State Name,...
dbl (1): Award Dollars

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

# Convert 'Award Date' to Date object using correct format and filter data from 2014 to 2024
data <- data %>%
  mutate(AwardDate = mdy(`Award Date`)) %>%
  filter(AwardDate >= as.Date("2014-01-01") & AwardDate <= as.Date("2024-12-31"))

# Aggregate data to get the total Award Dollars for each Grantee Name
total_award_per_grantee <- data %>%
  group_by(`Grantee Name`) %>%
  summarise(TotalAwardDollars = sum(`Award Dollars`, na.rm = TRUE)) %>%
  arrange(desc(TotalAwardDollars)) # Optional: arrange in descending order of total award dollars
# Assuming total_award_per_grantee is already created from previous steps

# Convert Grantee Name to a factor (if not already)
total_award_per_grantee$`Grantee Name` <- factor(total_award_per_grantee$`Grantee Name`)

# Assuming custom_colors is correctly mapped to your Grantee Names
custom_colors <- c("Boise State University" = "blue", 
                   "University of Idaho" = "gold", 
                   "Washington State University" = "red", 
                   "University of Montana" = "tan")


ggplot(total_award_per_grantee, aes(x = reorder(`Grantee Name`, TotalAwardDollars), y = TotalAwardDollars / 1e6, fill = `Grantee Name`)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("Boise State University" = "blue", 
                   "University of Idaho" = "gold", 
                   "Washington State University" = "red", 
                   "University of Montana" = "tan")) +
  labs(title = "USDA Total Award Dollars per University (2014-2024)",
       x = "Univeristy",
       y = "Total Award Dollars (Millions USD)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))  # For better readability of Grantee Names

Code

# Load necessary libraries
library(readr)
library(dplyr)
library(ggplot2)
library(lubridate)

# Read the data
data <- read_csv("USDAallU.csv")

Rows: 2314 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Award Date, Grant Number, Proposal Number, Grant Title, State Name,...
dbl (1): Award Dollars

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

# Prepare the data by grouping and summarising
grants_per_grantee <- data %>%
  group_by(`Grantee Name`) %>%
  summarise(NumberOfGrants = n()) %>%
  arrange(desc(NumberOfGrants))  # Optional: arrange in descending order of number of grants

# Define custom colors for specific Grantee Names
custom_colors <- c("Boise State University" = "blue", 
                   "University of Idaho" = "gold", 
                   "Washington State University" = "red", 
                   "University of Montana" = "tan")

# Create the bar graph with custom colors
ggplot(grants_per_grantee, aes(x = reorder(`Grantee Name`, NumberOfGrants), y = NumberOfGrants, fill = `Grantee Name`)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = custom_colors) +  # Apply custom colors
  labs(title = "Number of Grants per Grantee",
       x = "Grantee Name",
       y = "Number of Grants") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))  # Rotate X-axis labels for better readability

Conclusions

Let me start by saying that the 80/20 rule is completely accurate. I probably spent more than 80% of the time cleaning, merging, and generally preparing the appropriate data sets to produce these few data visualizations. There’s no convention among funding agencies on how they report their data; this would make things way easier for universities and institutions to have an idea of how they are doing in comparison to the rest.

DOE shows the largest growing potential, with just 4 PIs; there’s a big opportunity to grow.

UofI has a solid funding history with NIH with a few anchor PIs that got large total amounts.

In general, UofI is doing well, especially with USDA. Considering its size (around 12,000 students), UofI is comparable to Montana State (10,000 students), and the data shows it; numbers are similar for USDA. WSU (23,000 students) is significantly larger and it makes sense that it got more USDA projects. Boise State (23,000 students) is not an agricultural sciences-based university, but it is as large as WSU, so if they were to get into agricultural sciences, they would certainly be a threat to UofI’s USDA funding.

What I learned is that these types of analyses are way more complicated than they look. Choosing the right variables and presenting the data in a meaningful way is critical to arriving at the right conclusions. Also, since there’s a significant amount of data wrangling involved, the chances of making a mistake can’t be overlooked, so a quality assessment is mandatory to double-check for any possible mistakes.