Executive Summary

In November 2020, MILRD and partner organization MindsOf Initiative were awarded a grant from the Illumina Corporate Foundation to enroll 50 aspiring scientists at the high school and college levels from underrepresented backgrounds in MILRD’s Virtual Training Projects (VTPs). MILRD and MindsOf completed the project in January 2022. All told, the Illumina Corporate Foundation funds fully supported 55 scholarship students.

MILRD/MindsOf students completed one of four VTPs:

Metagenomics + Microbial Surveillance (Collaboration with Mason Lab, Cornell Medical Center)
Single-cell Transcriptomics + Lung Cell Characterization (Collaboration with Dr. Martina Bradic, Memorial Sloan Kettering Cancer Center)
Variant Calling + COVID-19 (Collaboration with Dr. Adriana Heguy, NYU Medical Center)
Single-cell Transcriptomics + Habenula Neuron Characterization (Collaboration with Dr. Mike Wallace, Sabatini Lab at Harvard Medical School)

Each weeklong VTP requires 15-20 hours of independent work for undergraduates and high-schoolers (in addition to the three scheduled hour-long meetings):

VTPs include:

Unlimited support from expert mentors (group channel + video conference calls)
Access to all required high-performance compute resources (AWS), analysis tools and software
Access to all source data required to complete the project
Optional Pre-VTP Preparation (completed the week prior to the VTP)

Participant Breakdown

library(flextable)
library(magrittr)
df <- data.frame(Category = c("Male Students / Female Students*", "High Schoolers / Undergraduates"), Percentage = c("25% / 75%", "27% / 73%"))
table <- df %>% regulartable() %>% autofit() 
table <- bold(table, bold = TRUE, part = "header")
table

Category	Percentage
Male Students / Female Students*	25% / 75%
High Schoolers / Undergraduates	27% / 73%

*The observed trend where more female students took part in VTPs than males follows with what is being seen across the landscape of online learning in Jamaica since the pandemic. This was highlighted in data presented by the nonprofit Caribbean Girls Hack where more male students have been lost in the move to online learning than female students.

library(flextable)
library(magrittr)
df_2 <- data.frame(VTP = c("Metagenomics + Microbial Surveillance", "Single-cell Transcriptomics + Lung Cell Characterization", "Variant Calling + COVID-19", "Single-cell Transcriptomics + Habenula Neuron Characterization"), Percentage = c("55% (30)", "27% (15)", "13% (7)", "5% (3)"))
table_2 <- df_2 %>% regulartable() 
table_2 <- bold(table_2, bold = TRUE, part = "header")
table_2 <- set_header_labels(table_2, VTP = "VTP", Percentage = "Percentage (# Participants)")%>% autofit() 
table_2

VTP	Percentage (# Participants)
Metagenomics + Microbial Surveillance	55% (30)
Single-cell Transcriptomics + Lung Cell Characterization	27% (15)
Variant Calling + COVID-19	13% (7)
Single-cell Transcriptomics + Habenula Neuron Characterization	5% (3)

Mentors

Paul Scheid (Founder, MILRD)
Camir Ricketts, PhD (Founder, MindsOf | Bioinformatics Scientist, NVIDIA)
Martina Bradic, PhD (Senior Computational Biologist, Memorial Sloan Kettering Cancer Center)
LaShanda R. Williams, PhD (Postdoctoral Fellow, Albert Einstein College of Medicine)
Michael Pitter (PhD Candidate, University of Michigan)
A. Brayan Campos-Salazar (PhD Candidate, Duke University)
Salwa Lin (PhD Candidate, Oxford University)
Olivia Goldman (PhD Candidate, Rockefeller University)

Conclusions

The project was successful overall. All participants reported they would recommend the VTPs to other students. An majority (> 85% of the students) requested to participate in 2-3 week extension VTPs. Many (> 20% of the students) have served as Assistant Mentors to new cohorts of the VTPs they previously completed, and in multiple cases, helped participants who had a lot more education/experience than themselves.

An impact assessment of a subset of MILRD/MindsOf students showed substantial increases in self-reported knowledge across all assessed categories following VTP completion.

In 2022, we plan to (1) enroll whole classes at high schools and colleges/universities, and already have multiple commitments and (2) enroll more teachers in VTPs for Professional Development/CTLE credit to promote integration of MILRD’s VTPs—and subsets of VTPs—into their courses.

Impact Assessment

Overview

To understand the impact of our VTPs, we conducted a pre/post self-efficacy study with 20 students from the MILRD/MindsOf project who completed the Metagenomics + Microbial Surveillance VTP between January 1, 2022 and January 31, 2022. We selected this VTP because the greatest number of students completed it.

Variables and Measures

This study assessed the Metagenomics + Microbial Surveillance VTP as an intervention to increase knowledge of: genomics data format/structure, metagenomics sample processing & analysis, Linux/Bash terminal use, and applications of bioinformatics tools.

The study design is a within-subjects (pre-post) design where we assess relevant dependent measures immediately before and immediately after workshop participation. There is no control group.

Of the 20 MILRD/MindsOf students in this study, 17 identified as ‘Black or African American’ and 3 identified as ‘Asian or Pacific Islander’. No students indicated they were of Hispanic or Latino descent.

The group was not large and varied enough to assess gender and education level as covariates:

library(flextable)
library(magrittr)
df_3 <- data.frame(Category = c("Gender*", "Education Level"), Participants = c("18: identified as 'Female'; 1: identified as 'Male'; 1: not reported", "17: 'undergraduate'; 3: 'high school'"))
table_3 <- df_3 %>% regulartable() 
table_3 <- bold(table_3, bold = TRUE, part = "header")
table_3 <- set_header_labels(table_3, Category = "Category", Percentage = "Breakdown")%>% autofit() 
table_3

Category	Participants
Gender*	18: identified as 'Female'; 1: identified as 'Male'; 1: not reported
Education Level	17: 'undergraduate'; 3: 'high school'

*Participants are presented a blank text box when asked to report their gender.

Students were asked to rate their knowledge level for each of these questions on a 6-point, whole number, scale from 0 (None) to 6 (Expert):

I understand how genomics data are structured/formatted
I understand how metagenomics data are collected and processed
I understand how metagenomics data are analyzed
I understand how to use the Linux/Bash terminal
I understand how bioinformatics tools can be used to answer a scientific question

Independent Variable: Time of Assessment (pre, post)

Dependent Measures: knowledge of (1) genomics data format/structure, (2) metagenomics sample processing, (3) metagenomics analysis, (4) Linux/Bash terminal use, and (5) applications of bioinformatics tools.

Results

Overall, students reported substantial increases in knowledge across all assessed categories following VTP completion: (a.) genomics data format knowledge (Cohen’s d = 2.67), (b.) metagenomics data collection/processing knowledge (Cohen’s d = 4.15), (c.) metagenomics analysis knowledge (Cohen’s d = 3.28), (d.) linux terminal/bash knowledge (Cohen’s d = 2.37), (e.) R/RStudio knowledge (Cohen’s d = 3.29), (f.) bioinformatics application knowledge (Cohen’s d = 2.21).

Please note: because students replied to the survey with whole-number answers, some of the lines that connect pre/post responses are on top of each other; thus 20 distinct lines aren’t always available.

a. Genomics data format

#library(vioplot)
library(ggpubr)
#library(ggplot2)
library(effsize)


pre = read.csv('pre_metagenomics_for_csv.csv', header=TRUE, sep=',')

post = read.csv('post_metagenomics_for_csv.csv', header=TRUE, sep=',')

#I understand how genomics data are structured_formatted    

pre_genomics_data <- pre$I.understand.how.genomics.data.are.structured_formatted
post_genomics_data <- post$I.understand.how.genomics.data.are.structured_formatted

summary(pre_genomics_data)
summary(post_genomics_data)

mean(post_genomics_data) - mean(pre_genomics_data)

sd(pre_genomics_data)
sd(post_genomics_data)

t.test(post_genomics_data, pre_genomics_data)

cohen.d(post_genomics_data, pre_genomics_data)

genomics_data <- data.frame(Pre_VTP = pre_genomics_data, Post_VTP = post_genomics_data)
ggpaired(genomics_data, cond1 = "Pre_VTP", cond2 = "Post_VTP",
         color = "condition", fill = NULL, line.color = "black", palette = 'ucscgb', title = "I understand how genomics data are structured/formatted", xlab = " ", ylab = "Student Response")

A paired-samples t-test showed that, as hypothesized, participants reported lower levels of genomics data format knowledge before completing the intervention (M = 0.550, SD = 0.999) than after (M = 3.55, SD = 1.23), M_diff = 3.0, t(20) = 8.45, p < .0001. The effect size is very large (Cohen’s d = 2.67).

b. Metagenomics data collection

#I understand how metagenomics data are collected and processed

pre_metagenomics_collected <- pre$I.understand.how.metagenomics.data.are.collected.and.processed
post_metagenomics_collected <- post$I.understand.how.metagenomics.data.are.collected.and.processed

summary(pre_metagenomics_collected )
summary(post_metagenomics_collected)

mean(post_metagenomics_collected) - mean(pre_metagenomics_collected )

sd(pre_metagenomics_collected )
sd(post_metagenomics_collected)

t.test(post_metagenomics_collected, pre_metagenomics_collected )

cohen.d(post_metagenomics_collected, pre_metagenomics_collected )


metagenomics_collected <- data.frame(Pre_VTP = pre_metagenomics_collected , Post_VTP = post_metagenomics_collected)
ggpaired(metagenomics_collected, cond1 = "Pre_VTP", cond2 = "Post_VTP",
         color = "condition", fill = NULL, line.color = "black", palette = 'ucscgb', title = "I understand how metagenomics data are collected and processed", xlab = "", ylab = "Student Response")

A paired-samples t-test showed that, as hypothesized, participants reported lower levels of metagenomics data collection/processing knowledge before completing the intervention (M = 0.450, SD = 0.605) than after (M = 3.85, SD = 0.988), M_diff = 3.40, t(20) = 13.1, p < .0001. The effect size is very large (Cohen’s d = 4.15).

c. Metagenomics analysis

#I understand how metagenomics data are analyzed    

pre_metagenomics_analyzed <- pre$I.understand.how.metagenomics.data.are.analyzed
post_metagenomics_analyzed  <- post$I.understand.how.metagenomics.data.are.analyzed

summary(pre_metagenomics_analyzed)
summary(post_metagenomics_analyzed)

mean(post_metagenomics_analyzed) - mean(pre_metagenomics_analyzed)

sd(pre_metagenomics_analyzed)
sd(post_metagenomics_analyzed)

t.test(post_metagenomics_analyzed, pre_metagenomics_analyzed)

cohen.d(post_metagenomics_analyzed, pre_metagenomics_analyzed)

metagenomics_analyzed <- data.frame(Pre_VTP = pre_metagenomics_analyzed , Post_VTP = post_metagenomics_analyzed)
ggpaired(metagenomics_analyzed, cond1 = "Pre_VTP", cond2 = "Post_VTP",
         color = "condition", fill = NULL, line.color = "black", palette = 'ucscgb', title = "I understand how metagenomics data are analyzed", xlab = "", ylab = "Student Response")

A paired-samples t-test showed that, as hypothesized, participants reported lower levels of metagenomics analysis knowledge before completing the intervention (M = 0.400, SD = 0.598) than after (M = 3.65, SD = 1.27), M_diff = 3.25, t(20) = 10.4, p < .0001. The effect size is very large (Cohen’s d = 3.28).

d. Linux/Bash terminal

#I understand how to use the Linux/Bash terminal    

pre_linux_bash <- pre$I.understand.how.to.use.the.Linux_Bash.terminal
post_linux_bash  <- post$I.understand.how.to.use.the.Linux_Bash.terminal

summary(pre_linux_bash)
summary(post_linux_bash)

mean(post_linux_bash) - mean(pre_linux_bash)

sd(pre_linux_bash)
sd(post_linux_bash)

t.test(post_linux_bash, pre_linux_bash)

cohen.d(post_linux_bash, pre_linux_bash)


linux_bash <- data.frame(Pre_VTP = pre_linux_bash, Post_VTP = post_linux_bash)
ggpaired(linux_bash, cond1 = "Pre_VTP", cond2 = "Post_VTP",
         color = "condition", fill = NULL, line.color = "black", palette = 'ucscgb', title = "I understand how to use the Linux/Bash terminal", xlab = "", ylab = "Student Response")

A paired-samples t-test showed that, as hypothesized, participants reported lower levels of linux terminal/bash knowledge before completing the intervention (M = 0.450, SD = 1.28) than after (M = 3.65, SD = 1.42), M_diff = 3.20, t(20) = 7.48, p < .0001. The effect size is very large (Cohen’s d = 2.37).

e. R/RStudio

#I understand how to use R progamming language and R_RStudio    

pre_r <- pre$I.understand.how.to.use.R_RStudio
post_r <- post$I.understand.how.to.use.R_RStudio

summary(pre_r)
summary(post_r)

mean(post_r) - mean(pre_r)

sd(pre_r)
sd(post_r)

t.test(post_r, pre_r)

cohen.d(post_r, pre_r)

r <- data.frame(Pre_VTP = pre_r, Post_VTP = post_r)
ggpaired(r, cond1 = "Pre_VTP", cond2 = "Post_VTP",
         color = "condition", fill = NULL, line.color = "black", palette = 'ucscgb', title = "I understand how to use R/RStudio", xlab = "", ylab = "Student Response")

A paired-samples t-test showed that, as hypothesized, participants reported lower levels of R/RStudio knowledge before completing the intervention (M = 0.350, SD = 0.8123) than after (M = 3.80, SD = 1.24), M_diff = 3.45, t(20) = 10.4, p < .0001. The effect size is very large (Cohen’s d = 3.29).

f. Bioinformatics applications

#I understand how bioinformatics tools can be used to answer a scientific question

pre_scientific_question <- pre$I.understand.how.bioinformatics.tools.can.be.used.to.answer.a.scientific.question
post_scientific_question <- post$I.understand.how.bioinformatics.tools.can.be.used.to.answer.a.scientific.question

summary(pre_scientific_question)
summary(post_scientific_question)

mean(post_scientific_question) - mean(pre_scientific_question)

sd(pre_scientific_question)
sd(post_scientific_question)

t.test(post_scientific_question, pre_scientific_question)

cohen.d(post_scientific_question, pre_scientific_question)

scientific_question <- data.frame(Pre_VTP = pre_scientific_question, Post_VTP = post_scientific_question)
ggpaired(scientific_question, cond1 = "Pre_VTP", cond2 = "Post_VTP",
         color = "condition", fill = NULL, line.color = "black", palette = 'ucscgb', title = "I understand how bioinformatics tools can help answer a scientific question", xlab = "", ylab = "Student Response")

A paired-samples t-test showed that, as hypothesized, participants reported lower levels of bioinformatics application knowledge before completing the intervention (M = 1.25, SD = 1.25) than after (M = 3.95, SD = 1.19), M_diff = 2.7, t(20) = 6.99, p < .0001. The effect size is very large (Cohen’s d = 2.21).

Student Feedback

Overview

Participants had positive experiences overall.

All participants reported they would recommend the VTPs to other students.
A majority (> 85% of the students) requested to participate in 2-3 week extension VTPs, if/when funds become available. VTP Extension Projects allow participants to dive deeper into a subject and its analysis methods after completing the 1-week VTP. In Extension Projects, participants complete a customized and independent research project with that VTP’s dataset and present a capstone to an established domain expert who can assess the work and provide an evaluation that can be used for a recommendation letter if desired.
Many (> 20% of the students) have served as Assistant Mentors (AMs) to new cohorts of the VTPs they previously completed, and in multiple cases, helped participants who had a lot more education/experience than themselves. The AM role is similar to that of a TA. We ask AMs for a 4 hour time commitment throughout the week, and they help new participants address the issues they encountered during their cohort (AMs are not expected to be experts or to address questions they are not comfortable answering). Assistant Mentoring helps participants gain a deeper understanding of the material and to interact with more researchers from academia and industry. MILRD pays an honorarium to each AM for their efforts.

Student Spotlight: Gabrielle Burke

Gabrielle is a biology major at the University of West Indies Mona in Kingston, Jamaica. She completed the Single-cell Transcriptomics + Habenula Neuron Characterization VTP. After Gabrielle completed her first VTP, she reached out to MindsOf founder Dr. Camir Ricketts for more advice on how she could pursue a career in computational biology (see email below).

Here are Gabrielle’s words about her experience in the MILRD VTP program:

This [Virtual Training Project] was unquestionably beneficial to my development as a scientist. I believe that I left the experience a more keen, and patient person….

…My favorite part of the [project] was learning how to interact with the BASH terminal, inputting the lines of code and seeing the results at the end. I like how I got to see the data become transformed as each checkpoint was reached.

I thought that I had to choose between my love for programming and biology/chemistry, I did not know that there was a career that combined them both. I had an amazing experience and I am glad I got an opportunity to become exposed to this career possibility.”

Email Ms. Burke sent to Dr. Camir Ricketts (MindsOf founder):

Student Spotlight: Anthony Givans

Anthony completed the Variant Calling + COVID-19 VTP as a high school student in Jamaica. He is currently a first year student at the University of Miami double majoring in Finance and Mathematics with a Minor in Computer Science.

Here are Anthony’s words about his experience in the MILRD VTP program:

This [VTP] program was so beneficial to my development. As a prospective Data Scientist, I got the opportunity to work in an RStudio environment and get more acquainted with R. It has definitely helped me to solidify my knowledge of the programming language. One of my favourite parts was interacting with the mentors, along with the other students doing the program with me. It was a unique opportunity to interact with persons more knowledgeable than me without feeling left out… The VTP gave me unprecedented access to research materials and undoubtedly developed my data and computational skills. I would recommend this VTP to anyone who is interested in a data related field.

…I was surrounded (virtually) with extremely smart individuals, some with PhDs, who made me feel welcome and excited to tackle the problems at hand. It was by no means easy, but by collaborating with the group and working together to fix bugs, we were successful in the end…. As a recent self-taught programmer at the time, I had never worked with such a large code base. That said, I was able to learn a few tips and tricks on how to structure and write clean and effective code, which has been helping me ever since.

It has helped me so much that I have gone on to do many exciting things with code and I even decided to tack on a Computer Science minor to my already strenuous course load…. All in all, my participation in the program helped me a lot and I am very grateful.”

Future Directions

In 2022, we plan to: (1) scale our programs more quickly by enrolling whole classes at high schools and colleges/universities, and already have many commitments; (2) enroll more teachers in VTPs for Professional Development/CTLE credit to promote integration of MILRD’s VTPs into the courses they are teaching; and (3) experiment with launching subsets of VTPs so teachers can integrate bioinformatics/computational biology exercises into their courses, even if they are unable to commit 1-2 weeks to a full VTP.

Per (1), one of MILRD’s focuses in 2021 was to seek out partnerships with schools and education organizations so that we could more easily enroll large student cohorts moving forward. This effort really started to bear fruit in Q4 of 2021 and is the result of two notable initiatives.

First, MILRD presented a VTP workshop, and was a non-profit exhibitor, at the annual conference of the Science Teacher Association of New York State (STANYS), New York’s oldest professional organization of Pre-K to University public-school science educators.
Second, we integrated one of our VTPs into a SUNY Binghamton Undergraduate Microbiology Course with 80 students.

Both of these initiatives have led to partnerships with schools and education organizations.

With regards to continuing the MindsOf/MILRD project, we plan to enroll 75 independent students this coming year and have confirmed two course integrations with University of the West Indies, provided we procure the funds to support them:

library(flextable)
library(magrittr)
df_3 <- data.frame(Course = c("UWI BIOL1018 Molecular Biology & Genetics (Fall Semester 2022)", "UWI ZOOL3410 Advanced Topics in Animal Science (Spring Semester, January 2023)"), Student_Enrollment = c(340, 30))
table_3 <- df_3 %>% regulartable() 
table_3 <- bold(table_3, bold = TRUE, part = "header")
table_3 <- set_header_labels(table_3, Course = "Course", Student_Enrollment = "Student Enrollment")%>% autofit() 
table_3

Course	Student Enrollment
UWI BIOL1018 Molecular Biology & Genetics (Fall Semester 2022)	340
UWI ZOOL3410 Advanced Topics in Animal Science (Spring Semester, January 2023)	30

Outside of our collaboration with MindsOf, MILRD has confirmed the following course integrations with schools and education organizations, provided we procure the funding to support them:

library(flextable)
library(magrittr)
df_4 <- data.frame(org = c("Hamutay Young Peruvian Scientist Network for Undergraduate Researchers",    "NYS North Rockland High School", "NYS Saugerties High School", "PhilaSD Lankenau Environmental Science Magnet High School  (via Steppingstone Scholars)", "Science Teacher Association of New York State", "PhilaSD G.W. Carver HS for Engineering & Science (via Steppingstone Scholars)", "SUNY Binghamton Cellular/Molecular Biology Lab Course"), Student_Enrollment = c(53, 67, 40, 35, 35, 55, 85), confirmed_dates = c("Mar 23 - Apr 1", "April 4-15", "April 18-29", "May (dates TBD)", "May (dates TBD)", "May 30 – Jun 10", "Nov 7 – Nov 18" ))
table_4 <- df_4 %>% regulartable() 
table_4 <- bold(table_4, bold = TRUE, part = "header")
table_4 <- set_header_labels(table_4, org = "School / Organization", Student_Enrollment = "Student Enrollment", confirmed_dates = "Confirmed Dates" )%>% autofit() 
table_4

School / Organization	Student Enrollment	Confirmed Dates
Hamutay Young Peruvian Scientist Network for Undergraduate Researchers	53	Mar 23 - Apr 1
NYS North Rockland High School	67	April 4-15
NYS Saugerties High School	40	April 18-29
PhilaSD Lankenau Environmental Science Magnet High School (via Steppingstone Scholars)	35	May (dates TBD)
Science Teacher Association of New York State	35	May (dates TBD)
PhilaSD G.W. Carver HS for Engineering & Science (via Steppingstone Scholars)	55	May 30 – Jun 10
SUNY Binghamton Cellular/Molecular Biology Lab Course	85	Nov 7 – Nov 18

Per (2) and (3), MILRD recently completed our first teacher professional development (PD) event for CTLE credit in collaboration with STANYS (8 hours CTLE credit, Tuesdays & Thursdays Feb 1-10 from 6:30pm - 8:30 pm) with 9 teachers. We plan to do more PD events with STANYS and are actively recruiting other organizations/schools to partner with for these offerings.

The STANYS PD event highlighted that some teachers want to integrate bioinformatics/data science into their bio courses, but that the 1-2 week commitment for a full VTP isn’t feasible for the courses they teach. We’re working to offer VTP subsets so teachers can integrate smaller (0.5 - 1.5 hours total effort) bioinformatics/data science exercises into their courses.

Impact Report: Illumina Corporate Foundation + MILRD/MindsOf Initiative

Executive Summary

Participant Breakdown

Mentors

Conclusions

Impact Assessment

Overview

Variables and Measures

Results

a. Genomics data format

b. Metagenomics data collection

c. Metagenomics analysis

d. Linux/Bash terminal

e. R/RStudio

f. Bioinformatics applications

Student Feedback

Overview

Student Spotlight: Gabrielle Burke

Student Spotlight: Anthony Givans

Future Directions