1. Executive Summary

MILRD offers Virtual Training Projects (VTPs) in advanced bioinformatics and other data science techniques for professionals, trainees, and students.

In January 2022, MILRD successfully completed a year-long initiative funded by the Illumina Corporate Foundation and individual donors to support 170 aspiring scientists at the high school and college levels from underrepresented backgrounds and teachers from high-need schools in MILRD’s VTPs—and we are now seeking follow-on funding.

To build on these efforts, MILRD is seeking $157,500 ($250 /participant * 630 participants) funding to imminently integrate our Metagenomics + Microbial Surveillance VTP into courses at high-need colleges and high schools to train 600 students and 30 instructors in bioinformatics. We will also test if VTP Subsets—short, standalone VTP components that are completed prior to optionally embarking on the full project—are effective at increasing student engagement and optimizing course integrations by teachers.

Overall, the goal of this project is to use student and teacher feedback to more effectively integrate and scale our VTPs moving forward—with an emphasis on accommodating under-resourced schools.

We have secured course integrations in April, May, June 2022 and the Fall Semester—pending the procurement of philanthropic funds to support them.

2. Program Overview

MILRD is seeking funds to integrate our Metagenomics + Microbial Surveillance VTP into courses at high-need high schools and colleges to enroll 600 students and 30 instructors.

So far, we have secured these course integrations—comprising 500 students and 25 instructors—pending successful fundraising:

We are in discussions with additional schools and are confident we will secure course intregrations compromising at least 100 more students and 5 more instructors.

a. Collaborating Organizations

MindsOf and Steppingstone Scholars are coordinating course integrations with schools in their networks.

MindsOf Initiative is a mentorship program that supports aspiring professionals of Caribbean heritage across a broad range of fields. This initiative provides career guidance and training opportunities to high schoolers, undergraduates, and young professionals. MindsOf and MILRD collaborate to provide cost-free VTPs to underrepresented minority students.

Steppingstone Scholars is an educational social mobility non-profit organization. For low income students in the City of Philadelphia there are often no clear pathways to college or the workforce. Since 1999, Steppingstone has been working to address this systemic problem by creating not just one pathway, but many. Steppingstone Ventures programming reaches over 1000 students per year, preparing students for their futures while serving as an innovation hub for better ways to support students with a focus on STEM enrichment and university partnerships.

c. VTP Structure

VTPs are 1-2 week training projects teaching bioinformatic pipelines for analyzing data in a specific subdomain, e.g. Metagenomics. These projects are generally based on reproducing the results from peer-reviewed scientific literature, starting with raw data and working through to the data visualization and inference. When integrated into a course, VTPs integrate bioinformatic and data science competencies into the classroom environment while leveraging the scalability of MILRD’s platform.

Course integration VTPs include:

  • Unlimited support from expert mentors (group channel + video conference calls)
  • Access to all required high-performance compute resources (AWS), analysis tools and software
  • Access to all source data required to complete the project
  • Pre-VTP Preparation in the programming languages and environments students will use to complete assigned bioinformatics tasks (optional)
  • Post-VTP assistance to help students download/install datasets/software used in the VTPs on their local computers so they can continue analyses offline (optional)

Working with the Dundee Central School, SUNY Binghamton, and STANYS Teacher Professional Development pilots, we have optimized our Metagenomics + Microbial Surveillance VTP to require 6-8 hours of class time all-in.

d. Key Staff and Mentors

Key Staff:

VTP Mentors:

3. Timeline

We have secured course integration opportunities for 7 courses across 6 schools over the next 10 months, comprising 25 instructors and 500 students (see table in Section 2).

Instructor Professional Development (PD) training will take place approximately 5-10 days before each course integration. All instructors have committed to completing this component of the project.

4. Goals & Deliverables

Aim 1 Train instructors to assist students with VTPs using accelerated professional development (PD) sessions

In order to effectively scale course-integrated VTPs in the long term, MILRD needs to enable teachers to quickly develop familiarity with VTP contents and how those contents overlap with their course goals. Towards this goal, we will use accelerated PD sessions to train teachers whose courses we are integrating and use their feedback to iterate our PD structure.

In this project we will train 25 instructors (8 high school teachers, 3 professors, and 14 teaching assistants) prior to working with each instructor’s students.

We will evaluate pre/post efficacy feedback from teachers: (1) before and after the teacher PD and (2) before and after their students complete the course-integrated VTP.

MILRD has experience working with high-school teachers, which provides a starting point for this aim. To date, we have taken 10 teachers through complete VTPs and one of these teachers has integrated a VTP into their course. In November 2021, MILRD presented a VTP workshop, and was a non-profit exhibitor, at the annual conference of the Science Teacher Association of New York State (STANYS), New York’s oldest professional organization of Pre-K to University public-school science educators. The STANYS conference:

  1. Provided an opportunity to develop an accelerated version of the Metagenomics + Microbial Surveillance VTP that could be completed in 1 hour for the workshop.

  2. Led us to partner with the STANYS organization to provide a teacher PD event for Continuing Teacher and Learner Education (CTLE) credit (8 hours CTLE credit, Feb 1-10, 2022) with 9 teachers.

  3. Introduced our organization to many teachers and schools throughout New York State, many of which have agreed to, or expressed interest in, course integration pilots.

We now have the requisite expertise to take teachers through an accelerated training of this VTP prior to their students participating.

Aim 2 Integrate VTPs into courses at high-need high schools & colleges

While we have seen VTPs benefit all types of students, it is especially important to enroll those who are socioeconomically disadvantaged and/or under-represented minorities. Compared with their more affluent and represented peers, these groups have a significantly disproportionate gap in knowledge regarding career and college readiness, potential career tracks, and the skills necessary to excel at them. We want to grow a platform and community for everyone, which means our AI-enabled platform cannot only learn and evolve from affluent students.

Towards this goal, we will take 600 students from high-need colleges and high schools through the VTP. We will evaluate pre/post efficacy feedback from students: before and after course-integrated VTPs.

We have completed pilot course integrations at the high school and college levels at high-need schools. The STANYS conference helped us quickly secure and complete a pilot integration at Dundee Central School with Jenn Clancy’s Environmental Science Course with 13 students.

In parallel, we integrated the same VTP into Dr. Peter McKenney’s SUNY Binghamton BIOL 425 Molecular Biology Laboratory Course with 91 students and six Teaching Assistants.

We are now in a position to scale and iterate our VTP platform with larger numbers of high school and college students.

Aim 3 Test and iterate VTP Subsets

VTP Subsets are smaller (0.5 - 1.5 hours total effort), standalone, bioinformatics/data science VTP elements that students/teachers can complete with the option to proceed with the full VTP project if desired. We have preliminary indications that VTP Subsets are a good way to actively engage students at the beginning of a VTP and are a good option for teachers who do not have the class time to complete a full VTP.

We want to test how valuable VTP subsets are for students and teachers. Towards this goal, we will structure the VTP integrations to start with a Subset and then complete the full VTP.

The pilots with Dundee Central School, SUNY Binghamton, and STANYS Teacher PD event provided critical data how to adapt our VTPs into Subsets for high school & college classrooms. For example, we realized that asking students to utilize the ssh protocol using a program like PuTTY at the start of a VTP can lead to many laptop-specific installation issues and discourage students at the outset. Thus, students now access their VTPs through a browser-based ssh client so they can hit the ground running; we teach them how to ssh manually later on in the VTP.

We also discovered (1) the need for an earlier “hook” in the VTP to encourage students; and (2) some teachers want to integrate VTPs into their biology courses, but that the 1-2 week commitment for a full VTP isn’t feasible for the courses they teach. To address this feedback, we conceptualized and developed VTP Subsets.

Linux Steps R Steps Subset Component

Each of these schools has committed to integrating the full VTP but we would have then start with the VTP Subset to show them a result quickly, and then have them go upstream, start with the raw data and proceed downstream from there to complete the full analysis.

5. Estimated Impact

We will train 600 students and 30 instructors in this project.

We will assess the impact of this project on student and instructors by evaluating whether the VTP increased knowledge of genomics data format/structure, metagenomics sample processing & analysis, Linux/Bash terminal use, and applications of bioinformatics tools

We have previously assessed the impact of this VTP on a smaller cohort (20 students, see Example Impact Assessment in Evaluation Measure) and reported very large effect sizes in each assessed knowledge category. We will investigate whether we can replicate these promising results in a much larger cohort.

6. Itemized Budget

$250 /participant * 630 = $157,500

Direct Costs: $102,375

  • Mentors: $47,250

  • High Performance compute instances (AWS): $25,200

  • HPC Sys Admin support + Software Engineering Support: $14,925

  • Subcontract to MindsOf: $10,000

  • Platform utilities (e.g. Slack, Zoom, GSuite): $5,000

Indirect Costs (35%): $55,125

7. Evaluation Measures

To understand the impact of our instructor PD and course-integrated VTPs, we will use a pre/post self-efficacy studies of teachers and students to assess whether there is increased knowledge of: genomics data format/structure, metagenomics sample processing & analysis, Linux/Bash terminal use, and applications of bioinformatics tools.

See “Example Impact Assessment” below from a previous project funded by the Illumina Corporate Foundation. We will use a similar efficacy assessment design in this project.

Example Impact Assessment

In November 2020, MILRD and partner organization MindsOf Initiative were awarded a grant from the Illumina Corporate Foundation to enroll 50 aspiring scientists at the high school and college levels from underrepresented backgrounds in MILRD’s Virtual Training Projects (VTPs). MILRD and MindsOf completed the project in January 2022. All told, the Illumina Corporate Foundation funds fully supported 55 scholarship students.

To understand the impact of our VTPs, we conducted a pre/post self-efficacy study with 20 students from the MILRD/MindsOf project who completed the Metagenomics + Microbial Surveillance VTP between January 1, 2022 and January 31, 2022. We selected this VTP because the greatest number of students completed it.

We plan to use a similar impact assessment design for both student and instructor participants.

Methods

This study assessed the Metagenomics + Microbial Surveillance VTP as an intervention to increase knowledge of: genomics data format/structure, metagenomics sample processing & analysis, Linux/Bash terminal use, and applications of bioinformatics tools.

The study design is a within-subjects (pre-post) design where we assess relevant dependent measures immediately before and immediately after workshop participation. There is no control group.

Of the 20 MILRD/MindsOf students in this study, 17 identified as ‘Black or African American’ and 3 identified as ‘Asian or Pacific Islander’. No students indicated they were of Hispanic or Latino descent.

The group was not large and varied enough to assess gender and education level as covariates:

*Participants are presented a blank text box when asked to report their gender.

Students were asked to rate their knowledge level for each of these questions on a 6-point, whole number, scale from 0 (None) to 6 (Expert):

  • I understand how genomics data are structured/formatted
  • I understand how metagenomics data are collected and processed
  • I understand how metagenomics data are analyzed
  • I understand how to use the Linux/Bash terminal
  • I understand how bioinformatics tools can be used to answer a scientific question

Independent Variable: Time of Assessment (pre, post)

Dependent Measures: knowledge of (1) genomics data format/structure, (2) metagenomics sample processing, (3) metagenomics analysis, (4) Linux/Bash terminal use, and (5) applications of bioinformatics tools.

Results

Overall, students reported substantial increases in knowledge across all assessed categories following VTP completion: (a.) genomics data format knowledge (Cohen’s d = 2.67), (b.) metagenomics data collection/processing knowledge (Cohen’s d = 4.15), (c.) metagenomics analysis knowledge (Cohen’s d = 3.28), (d.) linux terminal/bash knowledge (Cohen’s d = 2.37), (e.) R/RStudio knowledge (Cohen’s d = 3.29), (f.) bioinformatics application knowledge (Cohen’s d = 2.21).

Please note: because students replied to the survey with whole-number answers, some of the lines that connect pre/post responses are on top of each other; thus 20 distinct lines aren’t always available.

a. Genomics data format

#library(vioplot)
library(ggpubr)
#library(ggplot2)
library(effsize)


pre = read.csv('pre_metagenomics_for_csv.csv', header=TRUE, sep=',')

post = read.csv('post_metagenomics_for_csv.csv', header=TRUE, sep=',')

#I understand how genomics data are structured_formatted    

pre_genomics_data <- pre$I.understand.how.genomics.data.are.structured_formatted
post_genomics_data <- post$I.understand.how.genomics.data.are.structured_formatted

summary(pre_genomics_data)
summary(post_genomics_data)

mean(post_genomics_data) - mean(pre_genomics_data)

sd(pre_genomics_data)
sd(post_genomics_data)

t.test(post_genomics_data, pre_genomics_data)

cohen.d(post_genomics_data, pre_genomics_data)

genomics_data <- data.frame(Pre_VTP = pre_genomics_data, Post_VTP = post_genomics_data)
ggpaired(genomics_data, cond1 = "Pre_VTP", cond2 = "Post_VTP",
         color = "condition", fill = NULL, line.color = "black", palette = 'ucscgb', title = "I understand how genomics data are structured/formatted", xlab = " ", ylab = "Student Response")

A paired-samples t-test showed that, as hypothesized, participants reported lower levels of genomics data format knowledge before completing the intervention (M = 0.550, SD = 0.999) than after (M = 3.55, SD = 1.23), Mdiff = 3.0, t(20) = 8.45, p < .0001. The effect size is very large (Cohen’s d = 2.67).

b. Metagenomics data collection

#I understand how metagenomics data are collected and processed

pre_metagenomics_collected <- pre$I.understand.how.metagenomics.data.are.collected.and.processed
post_metagenomics_collected <- post$I.understand.how.metagenomics.data.are.collected.and.processed

summary(pre_metagenomics_collected )
summary(post_metagenomics_collected)

mean(post_metagenomics_collected) - mean(pre_metagenomics_collected )

sd(pre_metagenomics_collected )
sd(post_metagenomics_collected)

t.test(post_metagenomics_collected, pre_metagenomics_collected )

cohen.d(post_metagenomics_collected, pre_metagenomics_collected )


metagenomics_collected <- data.frame(Pre_VTP = pre_metagenomics_collected , Post_VTP = post_metagenomics_collected)
ggpaired(metagenomics_collected, cond1 = "Pre_VTP", cond2 = "Post_VTP",
         color = "condition", fill = NULL, line.color = "black", palette = 'ucscgb', title = "I understand how metagenomics data are collected and processed", xlab = "", ylab = "Student Response")

A paired-samples t-test showed that, as hypothesized, participants reported lower levels of metagenomics data collection/processing knowledge before completing the intervention (M = 0.450, SD = 0.605) than after (M = 3.85, SD = 0.988), Mdiff = 3.40, t(20) = 13.1, p < .0001. The effect size is very large (Cohen’s d = 4.15).

c. Metagenomics analysis

#I understand how metagenomics data are analyzed    

pre_metagenomics_analyzed <- pre$I.understand.how.metagenomics.data.are.analyzed
post_metagenomics_analyzed  <- post$I.understand.how.metagenomics.data.are.analyzed

summary(pre_metagenomics_analyzed)
summary(post_metagenomics_analyzed)

mean(post_metagenomics_analyzed) - mean(pre_metagenomics_analyzed)

sd(pre_metagenomics_analyzed)
sd(post_metagenomics_analyzed)

t.test(post_metagenomics_analyzed, pre_metagenomics_analyzed)

cohen.d(post_metagenomics_analyzed, pre_metagenomics_analyzed)

metagenomics_analyzed <- data.frame(Pre_VTP = pre_metagenomics_analyzed , Post_VTP = post_metagenomics_analyzed)
ggpaired(metagenomics_analyzed, cond1 = "Pre_VTP", cond2 = "Post_VTP",
         color = "condition", fill = NULL, line.color = "black", palette = 'ucscgb', title = "I understand how metagenomics data are analyzed", xlab = "", ylab = "Student Response")

A paired-samples t-test showed that, as hypothesized, participants reported lower levels of metagenomics analysis knowledge before completing the intervention (M = 0.400, SD = 0.598) than after (M = 3.65, SD = 1.27), Mdiff = 3.25, t(20) = 10.4, p < .0001. The effect size is very large (Cohen’s d = 3.28).

d. Linux/Bash terminal

#I understand how to use the Linux/Bash terminal    

pre_linux_bash <- pre$I.understand.how.to.use.the.Linux_Bash.terminal
post_linux_bash  <- post$I.understand.how.to.use.the.Linux_Bash.terminal

summary(pre_linux_bash)
summary(post_linux_bash)

mean(post_linux_bash) - mean(pre_linux_bash)

sd(pre_linux_bash)
sd(post_linux_bash)

t.test(post_linux_bash, pre_linux_bash)

cohen.d(post_linux_bash, pre_linux_bash)


linux_bash <- data.frame(Pre_VTP = pre_linux_bash, Post_VTP = post_linux_bash)
ggpaired(linux_bash, cond1 = "Pre_VTP", cond2 = "Post_VTP",
         color = "condition", fill = NULL, line.color = "black", palette = 'ucscgb', title = "I understand how to use the Linux/Bash terminal", xlab = "", ylab = "Student Response")

A paired-samples t-test showed that, as hypothesized, participants reported lower levels of linux terminal/bash knowledge before completing the intervention (M = 0.450, SD = 1.28) than after (M = 3.65, SD = 1.42), Mdiff = 3.20, t(20) = 7.48, p < .0001. The effect size is very large (Cohen’s d = 2.37).

e. R/RStudio

#I understand how to use R progamming language and R_RStudio    

pre_r <- pre$I.understand.how.to.use.R_RStudio
post_r <- post$I.understand.how.to.use.R_RStudio

summary(pre_r)
summary(post_r)

mean(post_r) - mean(pre_r)

sd(pre_r)
sd(post_r)

t.test(post_r, pre_r)

cohen.d(post_r, pre_r)

r <- data.frame(Pre_VTP = pre_r, Post_VTP = post_r)
ggpaired(r, cond1 = "Pre_VTP", cond2 = "Post_VTP",
         color = "condition", fill = NULL, line.color = "black", palette = 'ucscgb', title = "I understand how to use R/RStudio", xlab = "", ylab = "Student Response")

A paired-samples t-test showed that, as hypothesized, participants reported lower levels of R/RStudio knowledge before completing the intervention (M = 0.350, SD = 0.8123) than after (M = 3.80, SD = 1.24), Mdiff = 3.45, t(20) = 10.4, p < .0001. The effect size is very large (Cohen’s d = 3.29).

f. Bioinformatics applications

#I understand how bioinformatics tools can be used to answer a scientific question

pre_scientific_question <- pre$I.understand.how.bioinformatics.tools.can.be.used.to.answer.a.scientific.question
post_scientific_question <- post$I.understand.how.bioinformatics.tools.can.be.used.to.answer.a.scientific.question

summary(pre_scientific_question)
summary(post_scientific_question)

mean(post_scientific_question) - mean(pre_scientific_question)

sd(pre_scientific_question)
sd(post_scientific_question)

t.test(post_scientific_question, pre_scientific_question)

cohen.d(post_scientific_question, pre_scientific_question)

scientific_question <- data.frame(Pre_VTP = pre_scientific_question, Post_VTP = post_scientific_question)
ggpaired(scientific_question, cond1 = "Pre_VTP", cond2 = "Post_VTP",
         color = "condition", fill = NULL, line.color = "black", palette = 'ucscgb', title = "I understand how bioinformatics tools can help answer a scientific question", xlab = "", ylab = "Student Response")

A paired-samples t-test showed that, as hypothesized, participants reported lower levels of bioinformatics application knowledge before completing the intervention (M = 1.25, SD = 1.25) than after (M = 3.95, SD = 1.19), Mdiff = 2.7, t(20) = 6.99, p < .0001. The effect size is very large (Cohen’s d = 2.21).

8. Conclusion

To support the development of an inclusive scientific enterprise, it is important to provide computational biology experiences for low-income and underrepresented minority high school and college students and to train them in emergent skill sets like data science and navigating novel data sets. A workforce trained in these competencies will be more equipped to address 21st century problems, such as pandemic preparedness, widening gaps in income, disparate educational opportunities, and insufficient access to healthcare.