Introduction

VTP Goal

The primary goal of this Virtual Training Project (VTP) is to use the Linux and R programming languages to bioinformatically characterize, quantify and visualize the species composition of urban microbiome samples (i.e. subway swabs) from raw data to completed analysis.

What is bioinformatics

Why is the metagenomics application of bioinformatics useful?

In case you are thinking, "How would someone be able to use this knowledge in real life?", here are three real-world applications of the metagenomics analysis technique you will learn:

In a broader sense, bioinformatics is a subdivision of data science; principles learned in the former are highly relevant to the latter. We'll point these out as we go along.

Data Source

The source of the data you will analyze in this VTP is from the MetaSUB Project, an effort to characterize the "built environment" microbiomes of mass transit systems around the world, headed by Dr. Chris Mason's Lab at Weill Cornell Medical Center.

What is MetaSUB

MetaSUB an international consortium of laboratories created to establish a world-wide "DNA map" of microbiomes in mass transit systems. It is succeeded a previous study

MetaSUB consists of a group of labs from all over the world working together to study the microbes in mass transit systems like bacteria, viruses, and archaea (i.e. their microbiomes) and make a "DNA map" of sequences from them. The project was started by the Mason lab and is an extension of a previous study called Pathomap. Pathomap only looked at the NYC subway system and aimed to create a profile of what's normal, identify potential bio-threats, and use the data to make the city better for people's health. MetaSUB recognized that other cities around the world could benefit from a similar study and decided to expand the project to include subway systems worldwide.

How do MetaSUB Researchers Collect, Process, and Analyze Samples

Take a look at this figure, which provides an overview of MetaSUB's design and execution:

As you can see, the researchers collected samples from

(A) New York City's five boroughs

(B) Collected samples from the 466 subway stations of NYC across the 24 subway lines

(C) extracted, sequenced, QC'd and analyzed DNA

(D) Mapped the distribution of taxa identified from the entire pooled dataset, and

(E) presented geospatial analysis of a highly prominent genus, Pseudomonas

Notably, as seen in (D), nearly half of the DNA sequenced (48%) did not match any known organism, underscoring the vast wealth of likely unknown organisms that surround passengers every day.

Now, let's focus on Fig 1C, this is shows at a high level the steps to process and analyze samples. We'll be doing a very similar analysis in this VTP but not identical.

Here's a simplified version of sample processing and sequencing:

>

Alright, before moving onto the next section, here are some more resources in case you want to review them later:

Overview: Bioinformatics Analyses

Summary

Throughout this VTP, you will characterize, quantify and visualize microbial metagenomics data from sequenced swabs of public urban environments on your own cloud-hosted High Performance Compute Instance.

In the Linux terminal, you will learn the structure of raw genomic data and perform taxonomic characterization & abundance quantification. In R, you will determine the mose abundant species in your sample and visualize results at several taxonomic levels.

Here's a flowchart of the analysis you'll perform:

%% Sequence diagram code flowchart LR id1(Look:<br />assigned dataset)-->id2(Metagenomics<br />Analysis); id2-->id3(Identify:<br />Abundant Species); id3-->id4(Barplots); style id1 fill:#ddd, stroke:#524,stroke-width:4px; style id2 fill:#ddd, stroke:#524,stroke-width:4px; style id3 fill:#7be, stroke:#524,stroke-width:4px; style id4 fill:#7be, stroke:#524,stroke-width:4px;
Linux Step R Step Current Step

Cloud Computing

All bioinformatics tasks will be performed in "the cloud" on your own Amazon Web Services (AWS) hosted high performance compute instance.

What is cloud computing:

What is R and RStudio

For this project, you will execute all Linux and R tasks in RStudio.

RStudio is an integrated development environment (IDE) for the open-source R programming language, which basically means it brings the core components of R (Scripting Pane, Console, Environment Pane, File Manager/Plots/Help Pane) into a quadrant-based user interface for efficient use.

RStudio also comes with a Linux terminal built-in, where you can execute linux commands.

Access your Linux terminal by logging into RStudio via the web browser (can be found in the the Getting Started section.

Here is what the RStudio dashboard looks like:

Bioinformatics Tasks

Analysis Approach