Introduction

VTP Goal

The primary goal of this Virtual Training Project (VTP) is to use the Linux and R programming languages to bioinformatically characterize, quantify and visualize the species composition of urban microbiome samples (i.e. subway swabs) from raw data to completed analysis.

What is bioinformatics

Why is the metagenomics application of bioinformatics useful?

In case you are thinking, "How would someone be able to use this knowledge in real life?", here are three real-world applications of the metagenomics analysis technique you will learn:

Research Applications: you can ask questions about the similarity of microbiome samples, the prevalence/emergence of antimicrobial resistant strains, etc. The MetaSUB Project, which is where the data you'll analyze in this VTP comes from, would be an example of this application.
Clinical Applications: This technique is already being employed by some biotech startups, like Karius, to rapidly diagnose blood-borne infections.
Biotech/Industry Applications: Some companies offer microbial surveillance services to identify and monitor the presence of resistant pathogens (e.g. on hospital surfaces). One such startup called Biotia was spun out of the Mason lab (the same Cornell Med lab that created the MetaSUB project) and utilizes many of the same techniques that you will employ in this VTP."

In a broader sense, bioinformatics is a subdivision of data science; principles learned in the former are highly relevant to the latter. We'll point these out as we go along.

Data Source

The source of the data you will analyze in this VTP is from the MetaSUB Project, an effort to characterize the "built environment" microbiomes of mass transit systems around the world, headed by Dr. Chris Mason's Lab at Weill Cornell Medical Center.

What is MetaSUB

MetaSUB an international consortium of laboratories created to establish a world-wide "DNA map" of microbiomes in mass transit systems. It is succeeded a previous study

MetaSUB consists of a group of labs from all over the world working together to study the microbes in mass transit systems like bacteria, viruses, and archaea (i.e. their microbiomes) and make a "DNA map" of sequences from them. The project was started by the Mason lab and is an extension of a previous study called Pathomap. Pathomap only looked at the NYC subway system and aimed to create a profile of what's normal, identify potential bio-threats, and use the data to make the city better for people's health. MetaSUB recognized that other cities around the world could benefit from a similar study and decided to expand the project to include subway systems worldwide.

How do MetaSUB Researchers Collect, Process, and Analyze Samples

Take a look at this figure, which provides an overview of MetaSUB's design and execution:

As you can see, the researchers collected samples from

(A) New York City's five boroughs

(B) Collected samples from the 466 subway stations of NYC across the 24 subway lines

(D) Mapped the distribution of taxa identified from the entire pooled dataset, and

(E) presented geospatial analysis of a highly prominent genus, Pseudomonas

Notably, as seen in (D), nearly half of the DNA sequenced (48%) did not match any known organism, underscoring the vast wealth of likely unknown organisms that surround passengers every day.

Now, let's focus on Fig 1C, this is shows at a high level the steps to process and analyze samples. We'll be doing a very similar analysis in this VTP but not identical.

Here's a simplified version of sample processing and sequencing:

Alright, before moving onto the next section, here are some more resources in case you want to review them later:

Here's the MetaSUB paper in case you would like to review it later. To be frank, this article is a bit challenging to read, so we suggest you review it later on if you're inclined, after you've done a bit of the VTP.
New York Times article about the MetaSUB Project.
Here's the Pathomap paper, the predecessor study that only investigated the NYC Subway Microbiome.

Overview: Bioinformatics Analyses

Summary

Throughout this VTP, you will characterize, quantify and visualize microbial metagenomics data from sequenced swabs of public urban environments on your own cloud-hosted High Performance Compute Instance.

In the Linux terminal, you will learn the structure of raw genomic data and perform taxonomic characterization & abundance quantification. In R, you will determine the mose abundant species in your sample and visualize results at several taxonomic levels.

Here's a flowchart of the analysis you'll perform:

%% Sequence diagram code flowchart LR id1(Look:<br />assigned dataset)-->id2(Metagenomics<br />Analysis); id2-->id3(Identify:<br />Abundant Species); id3-->id4(Barplots); style id1 fill:#ddd, stroke:#524,stroke-width:4px; style id2 fill:#ddd, stroke:#524,stroke-width:4px; style id3 fill:#7be, stroke:#524,stroke-width:4px; style id4 fill:#7be, stroke:#524,stroke-width:4px;

Linux Step R Step Current Step

Cloud Computing

All bioinformatics tasks will be performed in "the cloud" on your own Amazon Web Services (AWS) hosted high performance compute instance.

What is cloud computing:

What is R and RStudio

For this project, you will execute all Linux and R tasks in RStudio.

RStudio is an integrated development environment (IDE) for the open-source R programming language, which basically means it brings the core components of R (Scripting Pane, Console, Environment Pane, File Manager/Plots/Help Pane) into a quadrant-based user interface for efficient use.

RStudio also comes with a Linux terminal built-in, where you can execute linux commands.

Access your Linux terminal by logging into RStudio via the web browser (can be found in the the Getting Started section.

Here is what the RStudio dashboard looks like:

Bioinformatics Tasks

Analysis Approach

See One, Do One, Repeat: In each step of this analysis, you will first see the analysis performed with Example Sample [sname_1]([sid_1]). You will be asked to reproduce the step with that same sample and confirm you get the same results. You will then be asked to execute the step with your own assigned sample, which is found in the Getting Started section.
Do not get hung up on coding syntax: our goal is not to teach you a programming language or languages from scratch. Our goal is for you to understand the inputs, parameters, outputs, and how to interpret the output. Some participants find it helpful to think of each step as a mathematical function (e.g. y = mx + b): regardless of how complicated a block of code or command is, you are always putting data in and getting data out.