Important Announcement 1

Before we start, everyone must complete the Pre-VTP survey: https://forms.gle/FA3gmijzCVf6Kiuv8

If you haven’t done this, please take 5 minutes to complete this now.

Important Announcement 2

Here is a Google Sheet that contains Sample Assignments, AWS and R addresses/info, and Participant Info:

https://docs.google.com/spreadsheets/d/1kGShY2587v0tYbqlc2G3v_vh6D5Ynu3fpEY_izYQzYY/edit?usp=sharing

Overview

  1. Introduction to the project.

  2. Rationale for the MetaSub Project

  3. Overview: bioinformatics analyses you’ll perform

1. Introduction to the project.

The primary goal of this project is to use the Linux and R programming languages to bioinformatically characterize, quantify and visualize the species composition of urban microbiome samples (i.e. subway swabs) from raw data to completed analysis.

What is bioinformatics?

Think, Room (4-5 ppl), Share:

  • 1-minute: think about one lab, activity, or lesson you prevously taught that had an element that could have been explored using bioinformatics analysis? (Explain)

  • 3-minutes: divide into breakout rooms and discuss with your group

  • 2-minutes: re-join the main meeting room and present 1-2 examples to the other participants.

Why is the metagenomics application of bioinformatics useful?

In case you (or more likely, your students) are thinking, “How would someone be able to use this knowledge in real life?”, here are three real-world applications of the metagenomics analysis technique you will learn:

  • Research Applications: you can ask questions about the similarity of microbiome samples, the prevalence/emergence of antimicrobial resistant strains, etc. The Metasub Project, which is where the data you’ll analyze in this VTP comes from, would be an example of this application.

  • Clinical Applications: This technique is already being employed by some biotech startups, like Karius (https://kariusdx.com/), to rapidly diagnose blood-borne infections.

  • Biotech/Industry Applications: Some companies offer microbial surveillance services to identify and monitor the presence of resistant pathogens (e.g. on hospital surfaces). One such startup called Biotia (https://www.biotia.io/) was spun out of the Mason lab (the same Cornell Med lab that created the Metasub project) and utilizes many of the same techniques that you will employ in this VTP."

In a broader sense, bioinformatics is a subdivision of data science; principles learned in the former are highly relevant to the latter. We’ll point these out as we go along.

History of metagenomics:

Example: Clinical Application of Metagenomics:

Think, Room (4-5 ppl), Share:

  • 1-minute: Come up with one potential application of metagenomics not discussed here.

  • 3-minutes: divide into breakout rooms and discuss with your group

  • 2-minutes: re-join the main meeting room and present 1-2 examples from your breakout room to the other participants.

2. Rationale for the MetaSub Project

To accomplish this goals in this project, we will use data from the Metasub project, an effort to characterize the “built environment” microbiomes of mass transit systems around the world, headed by Dr. Chris Mason’s lab at Weill Cornell Medical Center (http://www.masonlab.net/).

Here’s the recent Metasub paper in case you haven’t seen it and would like to review it later: https://milrd.org/wp-content/uploads/2022/09/Cell_A-global-metagenomic-map-of-urban-microbiomesand-antimicrobial-resistance.pdf

(To be frank, this article is a bit challenging to read, so we suggest you review it later on if you’re inclined, after you’ve done a bit of the project.)

Some additional information in case you’re interested:

Metasub was borne out of a project called PathoMap (also from the Mason lab), which began in summer 2013 to profile the New York City metagenome in, around, and below NYC on mass-transit areas of the built environment, focusing on the subway.

Here’s the Pathomap paper in case you would like to review it later: https://milrd.org/wp-content/uploads/2022/09/Geospatial-Resolution-of-Human-and-Bacterial-Diversity-with-City-Scale-Metagenomics.pdf

Pathomap sought to establish baseline profiles across the subway system, identify potential bio-threats, and provide an additional level of data that can be used by the city to create a “smart city;” i.e., one that uses high- dimensional data to improve city planning, management, and human health."

Metasub extended the Pathomap project based on the recognition that NYC is not the only city in the world that could benefit from a systematic, longitudinal metagenomic profile of its subway system.

Although NYC subway has the most stations, it ranks 7th in the world in term of the number of riders per year. A wide variety of population density, length, and climate types define the busiest subways of the world, ranging from cold (Moscow) to temperate (New York City, Paris), to subtropical (Mexico City) and tropical (São Paulo).

To address this gap in our knowledge of the built environment, the Mason lab created Metasub: an international consortium of laboratories to establish a world-wide “DNA map” of microbiomes in mass transit systems.

Take a look at Figure 1, which provides an overview of the Pathomap project’s design and execution:

As you can see, the researchers collected samples from

  1. New York City’s five boroughs

  2. Collected samples from the 466 subway stations of NYC across the 24 subway lines

  3. extracted, sequenced, QC’d and analyzed DNA

  4. Mapped the distribution of taxa identified from the entire pooled dataset, and

  5. presented geospatial analysis of a highly prominent genus, Pseudomonas

Notably, as seen in (D), nearly half of the DNA sequenced (48%) did not match any known organism, underscoring the vast wealth of likely unknown organisms that surround passengers every day.

Think, Room (4-5 ppl), Share:

  • 1-minute: Why do you think so much of the sequenced DNA did not match any known organism?

  • 3-minutes: divide into breakout rooms and discuss with your group

  • 2-minutes: re-join the main meeting room and present 1-2 examples from your breakout room to the other participants.

Now, let’s focus on Fig 1C, as samples from the Metasub project were collected, sequenced, and analyzed in a similar manner to the Pathomap project. Here’s a simplified version of what the Mason lab did in Fig 1C:

3. Overview: bioinformatics analyses you’ll perform

This guide is intended to teach you how to teach you one component of metagenomic analysis: how to plot abundances at the “phylum” level for each metagenomics sample.

Throughout the VTP, each participant characterizes, quantifies and visualize microbial metagenomics data from sequenced swabs of public urban environments on their own AWS High Performance Compute instance. In the Linux terminal, they perform genomic data quality control, genome alignment, taxonomic characterization & abundance quantification, and in R, they viaualize results, conduct a principal component analysis. To conclude, they investigate their most abundant species and use the Patric database to consider how they would determine the strains of these species.

Think, Room (4-5 ppl), Share:

  • 1-minute: Note that we’re using two programming languages (Linux/Bash and R). Why do you think it’s necessary to use two programming languages, and not just one?

  • 3-minutes: divide into breakout rooms and discuss with your group

  • 2-minutes: re-join the main meeting room and present 1-2 examples from your breakout room to the other participants.

All bioinformatics tasks will be performed in “the cloud” on an Amazon Web Services (AWS) hosted high performance compute instance.

What is cloud computing:

Think, Room (4-5 ppl), Share:

  • 1-minute: Why do we (as well as professional researchers, data scientists, bioinformaticians) often need to perform analyses on the cloud?

  • 3-minutes: divide into breakout rooms and discuss with your group

  • 2-minutes: re-join the main meeting room and present 1-2 examples from your breakout room to the other participants.

What is R and RStusio

RStudio is an integrated development environment (IDE) for the open-source R programming language, which basically means it brings the core components of R (Scripting Pane, Console, Environment Pane, File Manager/Plots/Help Pane) into a quadrant-based user interface for efficient use.

Here is what the R dashboard looks like:

In R, code is always executed via the Console, but you have the choice whether to execute that code in a Script (opened in the Scripting Pane) or directly in the Console. Unless you need to quicly execute a one-liner (e.g. setting a working directory using the setwd command), you’ll want to be writing your code in a script, highlighting it, and clicking Run to execute it in the console. This is because you can easily edit code in a script and re-run it. Once code is run in the Console it is not editable.

(1) Login to your AWS-hosted RStudio Instance (URL/username in the Google Sheet)

You are provided an instance of the R editor RStudio on AWS already waiting for you to use. Login to your AWS-hosted RStudio instance using the URL, username, and password assigned to you in the Google Sheet.

Please make sure the RStudio user interface dashboard looks as follows: Console Pane (Lower Left Quadrant), Scripting Pane (Upper Left Quadrant), File Manager/Plots Display/Help-Tab Pane (Lower Right Quadrant), Environment/History-Tab Pane (Upper Right Quadrant).

(2) Generate barplot(s) using subset of kraken samples

In this section, you will learn how to use R to generate two plots that help visualize the similarities and differences between a subset of the Metasub samples and compare it to metagenomics samples from the Human Microbiome project.

We have provided a subset of of metagnomics data (taxa_table.csv) and one with, for your convenience. This file contains taxonomically characterized and quantified metagenomics data from nine microbiome samples: three from the human microbiome project, three from the Metasub project, and three from a mystery source.

First we will plot our samples as a stacked-barplot at the phylum level. A stacked-barplot shows a set of numbers as a series of columns one on top of each other colored by a label. In our case, a taxonomic profile is a set of numerical abundances labeled by the microbial species it belongs to.

Here’s how we make this plot in R:

library(ggplot2)   # These lines load additional libraries into our environment

taxa = read.csv('taxa_table.csv', header=TRUE, sep=',')  # Read our taxonomic table into a computational object from a file

taxa = taxa[taxa$rank == 'phylum',]  # This filters our taxonomic table to a specific taxonomic rank. One of Kingdom, Phylum, Class, Order, Family, Genus, Species. Play around with a few other ranks.

ggplot(taxa, aes(x=sample, y=percent_abundance, fill=taxon)) + # this creates a ggplot object. Can you figure out what the aes(...) section is doing?
  geom_bar(stat="identity") +  # this tells ggplot how we want our data to be displayed
  xlab('Sample') +  # These lines tell ggplot what our axis labels should be
  ylab('Abundance') +
  labs(fill='Phylum') +
  theme(axis.text.x = element_text(angle = 90)) # this rotates the x-axis text 90 degrees
Here is a video of this step being performed:

(3) Clear your Console, Environment Tab, and Plots tab using the brush button and re-run the script chunk-by-chunk (noting what each chunk appears to be doing):

Chunk 1:

library(ggplot2)   # These lines load additional libraries into our environment

Chunk 1+2:

library(ggplot2)   # These lines load additional libraries into our environment

taxa = read.csv('taxa_table.csv', header=TRUE, sep=',')  # Read our taxonomic table into a computational object from a file

Chunk 1+2+3:

library(ggplot2)   # These lines load additional libraries into our environment

taxa = read.csv('taxa_table.csv', header=TRUE, sep=',')  # Read our taxonomic table into a computational object from a file

taxa = taxa[taxa$rank == 'phylum',]  # This filters our taxonomic table to a specific taxonomic rank. One of Kingdom, Phylum, Class, Order, Family, Genus, Species. Play around with a few other ranks.

Chunk 1+2+3+4:

library(ggplot2)   # These lines load additional libraries into our environment

taxa = read.csv('taxa_table.csv', header=TRUE, sep=',')  # Read our taxonomic table into a computational object from a file

taxa = taxa[taxa$rank == 'phylum',]  # This filters our taxonomic table to a specific taxonomic rank. One of Kingdom, Phylum, Class, Order, Family, Genus, Species. Play around with a few other ranks.

ggplot(taxa, aes(x=sample, y=percent_abundance, fill=taxon)) + # this creates a ggplot object. Can you figure out what the aes(...) section is doing?
  geom_bar(stat="identity") +  # this tells ggplot how we want our data to be displayed
  xlab('Sample') +  # These lines tell ggplot what our axis labels should be
  ylab('Abundance') +
  labs(fill='Phylum') +
  theme(axis.text.x = element_text(angle = 90)) # this rotates the x-axis text 90 degrees

(4) Generate barplot at the family level

  1. Modify code: edit your script to generate a barplot at the “family” level, by modifying this line of code:
taxa = taxa[taxa$rank == 'phylum',] 

CP Template:

Checkpoint Example:

Checkpoint (CP) 1 In a single message in your assigned Slack channel, please:

  1. Upload barplot at the phylum level
  2. Upload barplot at the family level
  3. Answer to these questions:
  • Which line of code ingests data from the taxa_table.csv file into the taxa object (copy paste the exact line of code)?

  • Which line of code allows you to filter down to a specific taxonomic level?

*What is the $ operator doing in this line of code: taxa = taxa[taxa$rank == 'phylum',] ?

  • Why wouldn’t you want to plot a barplot at the species level? (If you’re not sure, try it!)

  • What can the barplots tell you about the similarities and dissimilarities of microbiome samples?

  • What questions do barlots leave unanswered about microbiome samples?

(5.) Access your Linux terminal via your AWS-hosted RStudio instance

Access your Linux terminal by logging into RStudio via the web browser (can be found in the the Google Sheet.

Setup to your Linux environment:

It should look like this once you’re finished setting up your Linux environment: