(1) Executive Summary

MILRD is an education organization that offers Virtual Training Projects (VTPs) in advanced bioinformatics and other data science techniques for professionals, trainees, and students (graduate to high-school level).

The purpose of today’s presentation if to provide an overview of our VTPs and discuss volunteer opportunities for Illumina staff scientists and bioinformaticians.

(2) VTP Overview

VTPs are 1-2 week training projects teaching bioinformatic pipelines for analyzing data in a specific subdomain, e.g. Metagenomics. These projects are generally based on reproducing the results from peer-reviewed scientific literature, starting with raw data and working through to the data visualization and inference. When integrated into a course, VTPs integrate bioinformatic and data science competencies into the classroom environment.

MILRD Scholarship participants completed one of four VTPs:

Metagenomics + Microbial Surveillance (Collaboration with Mason Lab, Cornell Medical Center)
Single-cell Transcriptomics + Lung Cell Characterization (Collaboration with Dr. Martina Bradic, Memorial Sloan Kettering Cancer Center)
Variant Calling + COVID-19 (Collaboration with Dr. Adriana Heguy, NYU Medical Center)
Single-cell Transcriptomics + Habenula Neuron Characterization (Collaboration with Dr. Mike Wallace, Sabatini Lab at Harvard Medical School)

Each VTP takes 1-2 weeks and requires 15-20 hours of independent work (in addition to the three scheduled hour-long meetings):

Course integration VTPs include:

Unlimited support from expert mentors (group channel + video conference calls)
Access to all required high-performance compute resources (AWS), analysis tools and software
Access to all source data required to complete the project
Pre-VTP Preparation in the programming languages and environments students will use to complete assigned bioinformatics tasks (optional)
Post-VTP assistance to help students download/install datasets/software used in the VTPs on their local computers so they can continue analyses offline (optional)

Platform screenshot:

Participants are inrtoduced to the project, provided bioinformatics tasks, and assigned AWS instances and sample IDs on this platform.

Bifx instructions example:

Participants are shown a complete analysis with a single sample and then asked to execute the same set of steps on their sample.

AWS-hosted RStudio Instance:

Each participants is assigned their own AWS high-peroformance compute instance. Group Slack Channel for posting results and asking questions:

We currently use Slack for group collaboration, but are currently implementing our own group channel functionality.

Checkpoints Results Each set of VTP analysis tasks is broken down into Checkpoints (CPs).

CP Template:

Case Study Example: Metagenomics + Microbial Surveillance VTP

The Metagenomics + Microbial Surveillance VTP was created in collaboration with Mason Lab at Cornell Medical Center.

Throughout the VTP, each participant characterizes, quantifies and visualize microbial metagenomics data from sequenced swabs of public urban environments on their own AWS High Performance Compute instance. In the Linux terminal, they perform genomic data quality control, genome alignment, taxonomic characterization & abundance quantification, and in R, they viaualize results, conduct a principal component analysis. To conclude, they investigate their most abundant species and use the Patric database to consider how they would determine the strains of these species.

Linux Steps R Steps Subset Component

Checkpoint Example: https://milrd.org/wp-content/uploads/2022/06/Screen-Shot-2022-06-07-at-2.26.46-PM.png

(3) Past Scholarship Participant Breakdown

Scholarship participants from underrepresented and/or socioeconomically disadvantaged backgrounds and high-need organizations are sourced via our collaborators, including Social Good Fund, MindsOf Initiative, Science Teachers Association of New York State, and the State University of New York.

In Q4 of 2020, MILRD received funds from the Illumina Corporate Foundation and multiple individual donors to enroll aspiring scientists at the high school and college levels from underrepresented backgrounds and teachers from high-need schools in MILRD’s Virtual Training Projects (VTPs). This was our first round of grant funding to support disadvantaged students. MILRD used these funds to support scholarship students and teachers through January 2022. All told, 170 scholarship participants were supported with these funds.

Category	Percentage
High Schoolers	18%
Undergraduates	74%
Teachers	8%

Education/Career Level

Category	Percentage
Male	42%
Female	58%

Gender¹

¹Participants are presented a blank text box when asked to report their gender.

Category	Percentage
White	42%
Black or African American	34%
Asian	7%
Hispanic or Latino	7%
Native Hawaiian or Other Pacific Islanders	2%
Two or More Races	1%
American Indian or Alaska Native	0%
Other	8%

Race & Ethnicity

VTP	Percentage (# Participants)
Metagenomics + Microbial Surveillance	67% (115)
Single-cell Transcriptomics + Lung Cell Characterization	13% (22)
Variant Calling + COVID-19	12% (20)
Single-cell Transcriptomics + Habenula Neuron Characterization	8% (13)

VTP Enrollment

Mentors

Paul Scheid (Founder, MILRD)
Camir Ricketts, PhD (Founder, MindsOf | Bioinformatics Scientist, NVIDIA)
Martina Bradic, PhD (Senior Computational Biologist, Memorial Sloan Kettering Cancer Center)
LaShanda R. Williams, PhD (Postdoctoral Fellow, Albert Einstein College of Medicine)
Michael Pitter (PhD Candidate, University of Michigan)
A. Brayan Campos-Salazar (PhD Candidate, Duke University)
Salwa Lin (PhD Candidate, Oxford University)
Olivia Goldman (PhD Candidate, Rockefeller University)

Conclusions

The project was successful overall. All participants reported they would recommend the VTPs to other students. A majority (> 60% of the students) requested information on our 2-3 week extension VTPs. Many (> 15% of the students) have served as Assistant Mentors to new cohorts of the VTPs they previously completed, and in multiple cases, helped participants who had a lot more education/experience than themselves.

An impact assessment of MILRD scholarship undergraduate and high school students from the MILRD/MindsOf collaboration showed substantial increases in self-reported knowledge across all assessed categories following VTP completion.

In 2022, we plan to (1) enroll whole classes at high schools and colleges/universities, and have already secured multiple course integrations and (2) enroll more teachers in VTPs for Professional Development/CTLE credit to promote integration of MILRD’s VTPs—and subsets of VTPs—into their courses.

(4) Impact Assessment

Overview

To understand the impact of our VTPs, we conducted a pre/post self-efficacy study with 20 students from the MILRD/MindsOf project who completed the Metagenomics + Microbial Surveillance VTP between January 1, 2022 and January 31, 2022. We selected this VTP because the greatest number of students completed it.

Variables and Measures

This study assessed the Metagenomics + Microbial Surveillance VTP as an intervention to increase knowledge of: genomics data format/structure, metagenomics sample processing & analysis, Linux/Bash terminal use, and applications of bioinformatics tools.

The study design is a within-subjects (pre-post) design where we assess relevant dependent measures immediately before and immediately after workshop participation. There is no control group.

Of the 20 MILRD/MindsOf students in this study, 17 identified as ‘Black or African American’ and 3 identified as ‘Asian or Pacific Islander’. No students indicated they were of Hispanic or Latino descent.

The group was not large and varied enough to assess gender and education level as covariates:

library(flextable)
library(magrittr)
df_3 <- data.frame(Category = c("Gender*", "Education Level"), Participants = c("18: identified as 'Female'; 1: identified as 'Male'; 1: not reported", "17: 'undergraduate'; 3: 'high school'"))
table_3 <- df_3 %>% regulartable() 
table_3 <- bold(table_3, bold = TRUE, part = "header")
table_3 <- set_header_labels(table_3, Category = "Category", Percentage = "Breakdown")%>% autofit() 
table_3

Category	Participants
Gender*	18: identified as 'Female'; 1: identified as 'Male'; 1: not reported
Education Level	17: 'undergraduate'; 3: 'high school'

*Participants are presented a blank text box when asked to report their gender.

Students were asked to rate their knowledge level for each of these questions on a 6-point, whole number, scale from 0 (None) to 6 (Expert):

I understand how genomics data are structured/formatted
I understand how metagenomics data are collected and processed
I understand how metagenomics data are analyzed
I understand how to use the Linux/Bash terminal
I understand how bioinformatics tools can be used to answer a scientific question

Independent Variable: Time of Assessment (pre, post)

Dependent Measures: knowledge of (1) genomics data format/structure, (2) metagenomics sample processing, (3) metagenomics analysis, (4) Linux/Bash terminal use, and (5) applications of bioinformatics tools.

Results

Overall, students reported substantial increases in knowledge across all assessed categories following VTP completion: (a.) genomics data format knowledge (Cohen’s d = 2.67), (b.) metagenomics data collection/processing knowledge (Cohen’s d = 4.15), (c.) metagenomics analysis knowledge (Cohen’s d = 3.28), (d.) linux terminal/bash knowledge (Cohen’s d = 2.37), (e.) R/RStudio knowledge (Cohen’s d = 3.29), (f.) bioinformatics application knowledge (Cohen’s d = 2.21).

Please note: because students replied to the survey with whole-number answers, some of the lines that connect pre/post responses are on top of each other; thus 20 distinct lines aren’t always available.

a. Genomics data format

#library(vioplot)
library(ggpubr)
#library(ggplot2)
library(effsize)


pre = read.csv('pre_metagenomics_for_csv.csv', header=TRUE, sep=',')

post = read.csv('post_metagenomics_for_csv.csv', header=TRUE, sep=',')

#I understand how genomics data are structured_formatted    

pre_genomics_data <- pre$I.understand.how.genomics.data.are.structured_formatted
post_genomics_data <- post$I.understand.how.genomics.data.are.structured_formatted

summary(pre_genomics_data)
summary(post_genomics_data)

mean(post_genomics_data) - mean(pre_genomics_data)

sd(pre_genomics_data)
sd(post_genomics_data)

t.test(post_genomics_data, pre_genomics_data)

cohen.d(post_genomics_data, pre_genomics_data)

genomics_data <- data.frame(Pre_VTP = pre_genomics_data, Post_VTP = post_genomics_data)
ggpaired(genomics_data, cond1 = "Pre_VTP", cond2 = "Post_VTP",
         color = "condition", fill = NULL, line.color = "black", palette = 'ucscgb', title = "I understand how genomics data are structured/formatted", xlab = " ", ylab = "Student Response")

A paired-samples t-test showed that, as hypothesized, participants reported lower levels of genomics data format knowledge before completing the intervention (M = 0.550, SD = 0.999) than after (M = 3.55, SD = 1.23), M_diff = 3.0, t(20) = 8.45, p < .0001. The effect size is very large (Cohen’s d = 2.67).

b. Metagenomics data collection

#I understand how metagenomics data are collected and processed

pre_metagenomics_collected <- pre$I.understand.how.metagenomics.data.are.collected.and.processed
post_metagenomics_collected <- post$I.understand.how.metagenomics.data.are.collected.and.processed

summary(pre_metagenomics_collected )
summary(post_metagenomics_collected)

mean(post_metagenomics_collected) - mean(pre_metagenomics_collected )

sd(pre_metagenomics_collected )
sd(post_metagenomics_collected)

t.test(post_metagenomics_collected, pre_metagenomics_collected )

cohen.d(post_metagenomics_collected, pre_metagenomics_collected )


metagenomics_collected <- data.frame(Pre_VTP = pre_metagenomics_collected , Post_VTP = post_metagenomics_collected)
ggpaired(metagenomics_collected, cond1 = "Pre_VTP", cond2 = "Post_VTP",
         color = "condition", fill = NULL, line.color = "black", palette = 'ucscgb', title = "I understand how metagenomics data are collected and processed", xlab = "", ylab = "Student Response")

A paired-samples t-test showed that, as hypothesized, participants reported lower levels of metagenomics data collection/processing knowledge before completing the intervention (M = 0.450, SD = 0.605) than after (M = 3.85, SD = 0.988), M_diff = 3.40, t(20) = 13.1, p < .0001. The effect size is very large (Cohen’s d = 4.15).

c. Metagenomics analysis

#I understand how metagenomics data are analyzed    

pre_metagenomics_analyzed <- pre$I.understand.how.metagenomics.data.are.analyzed
post_metagenomics_analyzed  <- post$I.understand.how.metagenomics.data.are.analyzed

summary(pre_metagenomics_analyzed)
summary(post_metagenomics_analyzed)

mean(post_metagenomics_analyzed) - mean(pre_metagenomics_analyzed)

sd(pre_metagenomics_analyzed)
sd(post_metagenomics_analyzed)

t.test(post_metagenomics_analyzed, pre_metagenomics_analyzed)

cohen.d(post_metagenomics_analyzed, pre_metagenomics_analyzed)

metagenomics_analyzed <- data.frame(Pre_VTP = pre_metagenomics_analyzed , Post_VTP = post_metagenomics_analyzed)
ggpaired(metagenomics_analyzed, cond1 = "Pre_VTP", cond2 = "Post_VTP",
         color = "condition", fill = NULL, line.color = "black", palette = 'ucscgb', title = "I understand how metagenomics data are analyzed", xlab = "", ylab = "Student Response")

A paired-samples t-test showed that, as hypothesized, participants reported lower levels of metagenomics analysis knowledge before completing the intervention (M = 0.400, SD = 0.598) than after (M = 3.65, SD = 1.27), M_diff = 3.25, t(20) = 10.4, p < .0001. The effect size is very large (Cohen’s d = 3.28).

d. Linux/Bash terminal

#I understand how to use the Linux/Bash terminal    

pre_linux_bash <- pre$I.understand.how.to.use.the.Linux_Bash.terminal
post_linux_bash  <- post$I.understand.how.to.use.the.Linux_Bash.terminal

summary(pre_linux_bash)
summary(post_linux_bash)

mean(post_linux_bash) - mean(pre_linux_bash)

sd(pre_linux_bash)
sd(post_linux_bash)

t.test(post_linux_bash, pre_linux_bash)

cohen.d(post_linux_bash, pre_linux_bash)


linux_bash <- data.frame(Pre_VTP = pre_linux_bash, Post_VTP = post_linux_bash)
ggpaired(linux_bash, cond1 = "Pre_VTP", cond2 = "Post_VTP",
         color = "condition", fill = NULL, line.color = "black", palette = 'ucscgb', title = "I understand how to use the Linux/Bash terminal", xlab = "", ylab = "Student Response")

A paired-samples t-test showed that, as hypothesized, participants reported lower levels of linux terminal/bash knowledge before completing the intervention (M = 0.450, SD = 1.28) than after (M = 3.65, SD = 1.42), M_diff = 3.20, t(20) = 7.48, p < .0001. The effect size is very large (Cohen’s d = 2.37).

e. R/RStudio

#I understand how to use R progamming language and R_RStudio    

pre_r <- pre$I.understand.how.to.use.R_RStudio
post_r <- post$I.understand.how.to.use.R_RStudio

summary(pre_r)
summary(post_r)

mean(post_r) - mean(pre_r)

sd(pre_r)
sd(post_r)

t.test(post_r, pre_r)

cohen.d(post_r, pre_r)

r <- data.frame(Pre_VTP = pre_r, Post_VTP = post_r)
ggpaired(r, cond1 = "Pre_VTP", cond2 = "Post_VTP",
         color = "condition", fill = NULL, line.color = "black", palette = 'ucscgb', title = "I understand how to use R/RStudio", xlab = "", ylab = "Student Response")

A paired-samples t-test showed that, as hypothesized, participants reported lower levels of R/RStudio knowledge before completing the intervention (M = 0.350, SD = 0.8123) than after (M = 3.80, SD = 1.24), M_diff = 3.45, t(20) = 10.4, p < .0001. The effect size is very large (Cohen’s d = 3.29).

f. Bioinformatics applications

#I understand how bioinformatics tools can be used to answer a scientific question

pre_scientific_question <- pre$I.understand.how.bioinformatics.tools.can.be.used.to.answer.a.scientific.question
post_scientific_question <- post$I.understand.how.bioinformatics.tools.can.be.used.to.answer.a.scientific.question

summary(pre_scientific_question)
summary(post_scientific_question)

mean(post_scientific_question) - mean(pre_scientific_question)

sd(pre_scientific_question)
sd(post_scientific_question)

t.test(post_scientific_question, pre_scientific_question)

cohen.d(post_scientific_question, pre_scientific_question)

scientific_question <- data.frame(Pre_VTP = pre_scientific_question, Post_VTP = post_scientific_question)
ggpaired(scientific_question, cond1 = "Pre_VTP", cond2 = "Post_VTP",
         color = "condition", fill = NULL, line.color = "black", palette = 'ucscgb', title = "I understand how bioinformatics tools can help answer a scientific question", xlab = "", ylab = "Student Response")

A paired-samples t-test showed that, as hypothesized, participants reported lower levels of bioinformatics application knowledge before completing the intervention (M = 1.25, SD = 1.25) than after (M = 3.95, SD = 1.19), M_diff = 2.7, t(20) = 6.99, p < .0001. The effect size is very large (Cohen’s d = 2.21).

(5) Student Feedback

Overview

Participants had overwhelmingly positive experiences:

All participants reported they would recommend the VTPs to other students.
A majority (> 60% of the students) requested information about our 2-3 week extension VTPs, if/when funds become available. VTP Extension Projects allow participants to dive deeper into a subject and its analysis methods after completing the initial VTP. In Extension Projects, participants complete a customized and independent research project with that VTP’s dataset and present a capstone to an established domain expert who can assess the work and provide an evaluation that can be used for a recommendation letter if desired.
Many (> 15% of the students) have served as Assistant Mentors (AMs) to new cohorts of the VTPs they previously completed, and in multiple cases, helped participants who had a lot more education/experience than themselves. The AM role is similar to that of a TA. We ask AMs for a 4-hour time commitment throughout the VTP, and they help new participants address the issues they encountered during their cohort (AMs are not expected to be experts or to address questions they are not comfortable answering). Assistant Mentoring helps participants gain a deeper understanding of the material and to interact with more researchers from academia and industry. MILRD pays an honorarium to each AM for their efforts.

Scholarship Spotlight: Naomi Zakimi

Naomi Zakimi was a Genetics and Genomics major from UC Davis and completed the Metagenomics + Microbial Surveillance VTP. She and completed her degree shortly after completing the VTP and is now working at a bioinformatics lab (also at UC Davis).

Naomi later served as an Assistant Mentor to a new Metagenomics VTP cohort and additionally completed our Single-cell Transcriptomics + Lung Cell Characterization VTP.

Thank you…for this amazing opportunity! I learned so much from this VTP and found the topic very interesting

Scholarship Spotlight: Anthony Givans

Anthony completed the Variant Calling + COVID-19 VTP as a high school student in Jamaica. He is currently a first year student at the University of Miami double majoring in Finance and Mathematics with a Minor in Computer Science.

Here are Anthony’s words about his experience in the MILRD VTP program:

This [VTP] program was so beneficial to my development. As a prospective Data Scientist, I got the opportunity to work in an RStudio environment and get more acquainted with R. It has definitely helped me to solidify my knowledge of the programming language. One of my favourite parts was interacting with the mentors, along with the other students doing the program with me. It was a unique opportunity to interact with persons more knowledgeable than me without feeling left out… The VTP gave me unprecedented access to research materials and undoubtedly developed my data and computational skills. I would recommend this VTP to anyone who is interested in a data related field.

…I was surrounded (virtually) with extremely smart individuals, some with PhDs, who made me feel welcome and excited to tackle the problems at hand. It was by no means easy, but by collaborating with the group and working together to fix bugs, we were successful in the end…. As a recent self-taught programmer at the time, I had never worked with such a large code base. That said, I was able to learn a few tips and tricks on how to structure and write clean and effective code, which has been helping me ever since.

It has helped me so much that I have gone on to do many exciting things with code and I even decided to tack on a Computer Science minor to my already strenuous course load…. All in all, my participation in the program helped me a lot and I am very grateful.”

(6) Upcoming Scholarship Cohorts

MILRD has secured course integrations and scholarship students in collaboration with MindsOf and Steppingstone Scholars.

MindsOf Initiative is a mentorship program that supports aspiring professionals of Caribbean heritage across a broad range of fields. This initiative provides career guidance and training opportunities to high schoolers, undergraduates, and young professionals. MindsOf and MILRD collaborate to provide cost-free VTPs to underrepresented minority students.

Steppingstone Scholars is an educational social mobility non-profit organization. For low income students in the City of Philadelphia there are often no clear pathways to college or the workforce. Since 1999, Steppingstone has been working to address this systemic problem by creating not just one pathway, but many. Steppingstone Ventures programming reaches over 1000 students per year, preparing students for their futures while serving as an innovation hub for better ways to support students with a focus on STEM enrichment and university partnerships.

(7) Modalities of Volunteering

ILMN staff can volunteer to support MILRD Scholarship VTPs via two modalities. Each is a 1hr time committment.

Modality 1: Asynchronous cohort -group channel feedback.

VTP Mentors will:

@ ILMN volunteer in group Slack channel to add supporting context
ILMN volunteer will not be asked to trouble-shoot code or asked to do any task that is time-sensitive
Expected response time 36 hrs

Modality 1 Example:

Modality 2: Post-VTP Career Talk

ILMN volunteer will give a 30-60 minute talk about their career path and how they utilize the methods, data covered in the VTP in their everyday roles

MILRD Bifx Virtual Training Project Program Overview and Volunteer opportunities for Illumina Staff

Paul Scheid | [email protected]

2022-06-07

(1) Executive Summary

(2) VTP Overview

Case Study Example: Metagenomics + Microbial Surveillance VTP

(3) Past Scholarship Participant Breakdown

Mentors

Conclusions

(4) Impact Assessment

Overview

Variables and Measures

Results

a. Genomics data format

b. Metagenomics data collection

c. Metagenomics analysis

d. Linux/Bash terminal

e. R/RStudio

f. Bioinformatics applications

(5) Student Feedback

Overview

Scholarship Spotlight: Naomi Zakimi

Scholarship Spotlight: Anthony Givans

(6) Upcoming Scholarship Cohorts

(7) Modalities of Volunteering

MILRD Bifx Virtual Training Project Program Overview and Volunteer opportunities for Illumina Staff

Paul Scheid | [email protected]

2022-06-07

(1) Executive Summary

(2) VTP Overview

Case Study Example: Metagenomics + Microbial Surveillance VTP

(3) Past Scholarship Participant Breakdown

Mentors

Conclusions

(4) Impact Assessment

Overview

Variables and Measures

Results

a. Genomics data format

b. Metagenomics data collection

c. Metagenomics analysis

d. Linux/Bash terminal

e. R/RStudio

f. Bioinformatics applications

(5) Student Feedback

Overview

Scholarship Spotlight: Naomi Zakimi

Scholarship Spotlight: Anthony Givans

(6) Upcoming Scholarship Cohorts

(7) Modalities of Volunteering

(8) Sign up