⚠️ Please complete the Pre-VTP Survey before you start the project

1 Getting Started

⚠️ If haven’t completed the Pre-VTP Survey please do so before you start the project

1.1 RStudio, Assigned Dataset, Checkpoint Information

Here is a Google Sheet that contains each participant:

AWS-hosted RStudio URL, Username, and Password (General Info tab)
Sample Assignment Information (General Info tab)
Worksheet where you will upload your Checkpoint (CP) Results (Checkpoint (CP) Results tab)

1.2 Slack

Please koin the Slack channel for this course

1.3 Troubleshooting

When you have trouble, you can:

Double check your syntax
Slack URL

2 Introduction to the project.

The primary goal of this project is to use the Linux and R programming languages to bioinformatically characterize, quantify and visualize the species composition of urban microbiome samples (i.e. subway swabs) from raw data to completed analysis.

What is bioinformatics?

Think, Room (4-5 ppl), Share:

1-minute: think about one lab, activity, or lesson you prevously taught that had an element that could have been explored using bioinformatics analysis? (Explain)
3-minutes: divide into breakout rooms and discuss with your group
2-minutes: re-join the main meeting room and present 1-2 examples to the other participants.

Why is the metagenomics application of bioinformatics useful?

In case you (or more likely, your students) are thinking, “How would someone be able to use this knowledge in real life?”, here are three real-world applications of the metagenomics analysis technique you will learn:

Research Applications: you can ask questions about the similarity of microbiome samples, the prevalence/emergence of antimicrobial resistant strains, etc. The Metasub Project, which is where the data you’ll analyze in this VTP comes from, would be an example of this application.
Clinical Applications: This technique is already being employed by some biotech startups, like Karius (https://kariusdx.com/), to rapidly diagnose blood-borne infections.
Biotech/Industry Applications: Some companies offer microbial surveillance services to identify and monitor the presence of resistant pathogens (e.g. on hospital surfaces). One such startup called Biotia (https://www.biotia.io/) was spun out of the Mason lab (the same Cornell Med lab that created the Metasub project) and utilizes many of the same techniques that you will employ in this VTP.”

In a broader sense, bioinformatics is a subdivision of data science; principles learned in the former are highly relevant to the latter. We’ll point these out as we go along.

History of metagenomics:

Example: Clinical Application of Metagenomics:

Think, Room (4-5 ppl), Share:

1-minute: Come up with one potential application of metagenomics not discussed here.
3-minutes: divide into breakout rooms and discuss with your group
2-minutes: re-join the main meeting room and present 1-2 examples from your breakout room to the other participants.

3 Rationale for the MetaSub Project

To accomplish this goals in this project, we will use data from the Metasub project, an effort to characterize the “built environment” microbiomes of mass transit systems around the world, headed by Dr. Chris Mason’s lab at Weill Cornell Medical Center (http://www.masonlab.net/).

Here’s the recent Metasub paper in case you haven’t seen it and would like to review it later: https://milrd.org/wp-content/uploads/2022/09/Cell_A-global-metagenomic-map-of-urban-microbiomesand-antimicrobial-resistance.pdf

(To be frank, this article is a bit challenging to read, so we suggest you review it later on if you’re inclined, after you’ve done a bit of the project.)

Some additional information in case you’re interested:

New York Times article about the Metasub Project: https://www.nytimes.com/2021/05/26/science/microbes-subway-metasub-mason.html?smid=url-share
Metasub Project website: http://metasub.org

Metasub was borne out of a project called PathoMap (also from the Mason lab), which began in summer 2013 to profile the New York City metagenome in, around, and below NYC on mass-transit areas of the built environment, focusing on the subway.

Here’s the Pathomap paper in case you would like to review it later: https://milrd.org/wp-content/uploads/2022/09/Geospatial-Resolution-of-Human-and-Bacterial-Diversity-with-City-Scale-Metagenomics.pdf

Pathomap sought to establish baseline profiles across the subway system, identify potential bio-threats, and provide an additional level of data that can be used by the city to create a “smart city;” i.e., one that uses high- dimensional data to improve city planning, management, and human health.”

Metasub extended the Pathomap project based on the recognition that NYC is not the only city in the world that could benefit from a systematic, longitudinal metagenomic profile of its subway system.

Although NYC subway has the most stations, it ranks 7th in the world in term of the number of riders per year. A wide variety of population density, length, and climate types define the busiest subways of the world, ranging from cold (Moscow) to temperate (New York City, Paris), to subtropical (Mexico City) and tropical (São Paulo).

To address this gap in our knowledge of the built environment, the Mason lab created Metasub: an international consortium of laboratories to establish a world-wide “DNA map” of microbiomes in mass transit systems.

Take a look at Figure 1, which provides an overview of the Pathomap project’s design and execution:

As you can see, the researchers collected samples from

New York City’s five boroughs
Collected samples from the 466 subway stations of NYC across the 24 subway lines
extracted, sequenced, QC’d and analyzed DNA
Mapped the distribution of taxa identified from the entire pooled dataset, and
presented geospatial analysis of a highly prominent genus, Pseudomonas

Notably, as seen in (D), nearly half of the DNA sequenced (48%) did not match any known organism, underscoring the vast wealth of likely unknown organisms that surround passengers every day.

Think, Room (4-5 ppl), Share:

1-minute: Why do you think so much of the sequenced DNA did not match any known organism?
3-minutes: divide into breakout rooms and discuss with your group
2-minutes: re-join the main meeting room and present 1-2 examples from your breakout room to the other participants.

Now, let’s focus on Fig 1C, as samples from the Metasub project were collected, sequenced, and analyzed in a similar manner to the Pathomap project. Here’s a simplified version of what the Mason lab did in Fig 1C:

4 Overview: bioinformatics analyses

This guide is intended to teach you how to teach you one component of metagenomic analysis: how to plot abundances at the “phylum” level for each metagenomics sample.

Throughout the VTP, each participant characterizes, quantifies and visualize microbial metagenomics data from sequenced swabs of public urban environments on their own AWS High Performance Compute instance. In the Linux terminal, they perform genomic data quality control, genome alignment, taxonomic characterization & abundance quantification, and in R, they viaualize results, conduct a principal component analysis. To conclude, they investigate their most abundant species and use the Patric database to consider how they would determine the strains of these species.

Linux Steps R Steps

Think, Room (4-5 ppl), Share:

1-minute: Note that we’re using two programming languages (Linux/Bash and R). Why do you think it’s necessary to use two programming languages, and not just one?
3-minutes: divide into breakout rooms and discuss with your group
2-minutes: re-join the main meeting room and present 1-2 examples from your breakout room to the other participants.

All bioinformatics tasks will be performed in “the cloud” on an Amazon Web Services (AWS) hosted high performance compute instance.

What is cloud computing:

Think, Room (4-5 ppl), Share:

1-minute: Why do we (as well as professional researchers, data scientists, bioinformaticians) often need to perform analyses on the cloud?
3-minutes: divide into breakout rooms and discuss with your group
2-minutes: re-join the main meeting room and present 1-2 examples from your breakout room to the other participants.

What is R and RStusio

RStudio is an integrated development environment (IDE) for the open-source R programming language, which basically means it brings the core components of R (Scripting Pane, Console, Environment Pane, File Manager/Plots/Help Pane) into a quadrant-based user interface for efficient use.

Here is what the R dashboard looks like:

In R, code is always executed via the Console, but you have the choice whether to execute that code in a Script (opened in the Scripting Pane) or directly in the Console. Unless you need to quicly execute a one-liner (e.g. setting a working directory using the setwd command), you’ll want to be writing your code in a script, highlighting it, and clicking Run to execute it in the console. This is because you can easily edit code in a script and re-run it. Once code is run in the Console it is not editable.

5 Bioinformatics Tasks

5.1 Login to your RStudio Instance

You are provided an instance of the R editor RStudio on AWS already waiting for you to use. Login to your AWS-hosted RStudio instance using the URL, username, and password assigned to you in the Google Sheet.

Please make sure the RStudio user interface dashboard looks as follows: Console Pane (Lower Left Quadrant), Scripting Pane (Upper Left Quadrant), File Manager/Plots Display/Help-Tab Pane (Lower Right Quadrant), Environment/History-Tab Pane (Upper Right Quadrant).

5.2 Generate barplot(s) using subset of metagenomic samples

Current Step Linux Steps R Steps

In this section, you will learn how to use R to generate two plots that help visualize the similarities and differences between a subset of the Metasub samples and compare it to metagenomics samples from the Human Microbiome project.

We have provided a subset of of metagnomics data (taxa_table.csv) and one with, for your convenience. This file contains taxonomically characterized and quantified metagenomics data from nine microbiome samples: three from the human microbiome project, three from the Metasub project, and three from a mystery source.

First we will plot our samples as a stacked-barplot at the phylum level. A stacked-barplot shows a set of numbers as a series of columns one on top of each other colored by a label. In our case, a taxonomic profile is a set of numerical abundances labeled by the microbial species it belongs to.

Here’s the template code to make this plot in R:

library(ggplot2)  

taxa_yourSample = read.csv('yourSample.csv', header=TRUE, sep=',') 

phyla_yourSample = taxa_yourSample[taxa_yourSample$rank == 'phylum',] 

phyla_yourSample = phyla_yourSample[phyla_yourSample$percent_abundance >= 2,] 

ggplot(phyla_yourSample, aes(x=sample, y=percent_abundance, fill=taxon)) +
  geom_bar(stat="identity") +  
  xlab('Sample') + 
  ylab('Abundance') +
  labs(fill='Phylum') +
  theme(axis.text.x = element_text(angle = 90))

These line loads the ggplot2 library into our environment. The library() function is used to load additional libraries that contain functions to be utilized by the script that follows.
Reads tabular taxonomic data into a computational object from a .csv file using the read.csv() function.
This filters our taxonomic table to a specific taxonomic rank (i.e. Kingdom, Phylum, Class, Order, Family, Genus, Species).
This filters out and discards data for a specified variable below the specified threshold.
this creates a ggplot() object.
this tells ggplot() how we want our data to be displayed
These lines tell ggplot() what our axis labels should be
this rotates the x-axis text 90 degrees

5.2.1 Generate Barplot: See one, do one

Save space in your document using buttons or tabs for sub chapters. Add this code at the end of your title:

See One (Example)
Sample: `mini_SL342389`

This code generates a phylum-level barplot by modifying taxa_yourSample, mini_yourSample.csv, and phyla_yourSample per Example sample.

library(ggplot2)  

taxa_mini_SL342389 = read.csv('mini_SL342389_taxa.csv', header=TRUE, sep=',') 

phyla_mini_SL342389 = taxa_mini_SL342389[taxa_mini_SL342389$rank == 'phylum',] 

phyla_mini_SL342389 = phyla_mini_SL342389[phyla_mini_SL342389$percent_abundance >= 2,] 

ggplot(phyla_mini_SL342389 , aes(x=sample, y=percent_abundance, fill=taxon)) +
  geom_bar(stat="identity") +  
  xlab('Sample') + 
  ylab('Abundance') +
  labs(fill='Phylum') +
  theme(axis.text.x = element_text(angle = 90))

Execute with your sample
Sample: `yourSample`

Generate a phylum-level barplot by modifying taxa_yourSample, mini_yourSample.csv, and phyla_yourSample per your assigned sample.

library(ggplot2)  

taxa_yourSample = read.csv('yourSample_taxa.csv', header=TRUE, sep=',') 

phyla_yourSample = taxa_yourSample[taxa_yourSample$rank == 'phylum',] 

phyla_yourSample = phyla_yourSample[phyla_yourSample$percent_abundance >= 2,] 

ggplot(phyla_yourSample, aes(x=sample, y=percent_abundance, fill=taxon)) +
  geom_bar(stat="identity") +  
  xlab('Sample') + 
  ylab('Abundance') +
  labs(fill='Phylum') +
  theme(axis.text.x = element_text(angle = 90))

Execute chunk-by-chunk
Sample: `yourSample`

Clear your Console, Environment Tab, and Plots tab using the 🧹 button and re-run the script chunk-by-chunk (noting what each chunk appears to be doing):

library(ggplot2)

taxa_yourSample = read.csv('yourSample_taxa.csv', header=TRUE, sep=',')

phyla_yourSample = taxa_yourSample[taxa_yourSample$rank == 'phylum',]

phyla_yourSample = phyla_yourSample[phyla_yourSample$percent_abundance >= 2,]

ggplot(taxa, aes(x=sample, y=percent_abundance, fill=taxon)) + 
  geom_bar(stat="identity") + 
  xlab('Sample') +  
  ylab('Abundance') +
  labs(fill='Phylum') +
  theme(axis.text.x = element_text(angle = 90))

Create: family-level barplot
Sample: `yourSample`

Modify code: edit your script to generate a barplot at the “family” level, by modifying this line of code:

phyla_yourSample = taxa_yourSample[taxa_yourSample$rank == 'phylum',]

`Checkpoint (CP) 1`

Please upload the following to the Google Sheet:

Barplot at the phylum level. (To download this plot as a file, click the (+) Zoom button, right click on the plot and select “Save Image As”.)
Upload barplot at the family level. (To download this plot as a file, click the (+) Zoom button, right click on the plot and select “Save Image As”.)
Answer to these questions:

1. Which line of code ingests data from the taxa_table.csv file into the taxa object (copy paste the exact line of code)?
1. Which line of code allows you to filter down to a specific taxonomic level?
1. What is the $ operator doing in this line of code: taxa = taxa[taxa$rank == 'phylum',] ?
1. Why wouldn’t you want to plot a barplot at the species level? (If you’re not sure, try it!)
1. What can the barplots tell you about the similarities and dissimilarities of microbiome samples?
1. What questions do barlots leave unanswered about microbiome samples?

5.3 Access your Linux terminal via RStudio

Current step Linux Steps R Steps

Access your Linux terminal by logging into RStudio via the web browser (can be found in the the Google Sheet.

Setup to your Linux environment:

It should look like this once you’re finished setting up your Linux environment:

5.4 Navigating your Linux terminal.

Linux is a command-line based operating system. For our purposes Linux = Unix.

Let’s do a brief review of Linux commands on our AWS-hosted Linux instance in RStudio. This the language and structure of this tutorial is heavily borrowed from http://www.ee.surrey.ac.uk/Teaching/Unix/. Go to this website and read the “Introduction to the UNIX Operating System”. For our purposes, you can treat Unix and Linux as interchangeable.

ls (list) When you first login, your current working directory is your home directory by default. Your home directory in every Linux system is denoted by the shorthand ~.

To find out what is in your home directory, type:

$ ls

The ls command ( lowercase L and lowercase S ) lists the contents of your current working directory.

You should see these Directories and files: R databases miniconda3 pass.txt reads rstudio taxa_table.csv

Note that the directories and files are colored differently.

ls does not cause all the files in your home directory to be listed—only those ones whose name does not begin with a dot (.). Files beginning with a dot (.) are known as hidden files and usually contain important program configuration information. They are hidden because you should not change them unless you are very familiar with Linux!!!

To list all files in your home directory including those whose names begin with a dot, type:

$ ls -a

As you likely guessed, the a in -a stands for all.

As you should now see, ls -a lists files that are normally hidden.

ls is an example of a command which can take options: -a is an example of an option. The options change the behaviour of the command. There are many online resources that tell you which options a particular command can take, and how each option modifies the behaviour of the command. We’ve posted a Linux command “Cheat Sheet” in the channel with Cody.

Making Directories

mkdir (make directory)

We will now make a subdirectory in your home directory to hold the files you will be creating and using in the course of this tutorial. To make a subdirectory calledunixstuff in your current working directory type:

$ mkdir unixstuff

To see the directory you have just created, type:

$ ls

Changing to a different directory

cd (change directory)

The command cd directory means change the current working directory to ‘directory’. The current working directory may be thought of as the directory you are in, i.e. your current position in the file-system tree.

To change to the directory you have just made, type:

$ cd unixstuff

Type ls and press Enter to see the contents (which should be empty)

Make another directory inside the unixstuff directory called backups

Still in the unixstuff directory, type:

$ ls -a

As you can see, in the unixstuff directory (and in all other directories), there are two special directories called (.) and (..)

The current directory (.) In UNIX, (.) means the current directory, so typing

$ cd .

NOTE: there is a space between cd and the dot

means stay where you are (the unixstuff directory).

This may not seem very useful at first, but using (.) as the name of the current directory will save a lot of typing, as we shall see later in the tutorial.

The parent directory (..) (..) means the parent of the current directory, so typing

$ cd ..

will take you one directory up the hierarchy (back to your home directory). Try it now.

Note: typing cd with no argument always returns you to your home directory. This is very useful if you are lost in the file system.

Pathnames pwd (print working directory) Pathnames (also called filepaths) enable you to work out where you are in relation to the whole file-system. For example, to find out the absolute pathname of your home-directory, type cd to get back to your home-directory and then type:

$ pwd

The full pathname will look something like this:

/home/user#

which means that user# (your home directory) is in the sub-directory home (the group directory), which is in the top-level root directory called ” / ” .

Unix File Structure

Use the commands cd, ls and pwd to explore the file system.

More about home directories and pathnames

Understanding pathnames First type cd to get back to your home-directory, then type:

$ ls unixstuff

to list the conents of your unixstuff directory.

Now type:

$ ls backups

You will get a message like this: backups: No such file or directory

This is because backups is not in your current working directory. To use a command on a file (or directory) not in the current working directory (the directory you are currently in), you must either cd to the correct directory, or specify its full pathname. To list the contents of your backups directory, you must type:

$ ls unixstuff/backups

~ (your home directory)

Home directories can also be referred to by the tilde ~ character. It can be used to specify paths starting at your home directory. So typing

$ ls ~/unixstuff

will list the contents of your unixstuff directory, no matter where you currently are in the file system.

What do you think ls ~ lists?

At this point, you have completed through Introduction and Tutorial One on http://www.ee.surrey.ac.uk/Teaching/Unix/. If you’re interested, you can continue with Tutorial Two (Copying Files, Moving Files, Removing Files and directories, Displaying the contents of a file on the screen, Searching the contents of a file) and subsequent Tutorials 3-8, but they are outside of the scope of the Linux skills you’ll need to complete the VTP.

Once you have finished all of the above exercises, please execute:

$ history

The history command shows the recently executed commands in your linux terminal.

Side Note: you can save your Linux command history to a file using this command:

$ history > yourName_linux_work.txt

yourName_linux_work.txt is a file that should contain all of the commands you’ve executed in your Linux terminal.

`Checkpoint (CP) 2`

Please take a screenshot of the last 15 lines of your history output and upload it to the Google Sheet.

5.5 Look at your assigned dataset (time estimate: 10 min)

Current step Linux Steps R Steps

To make sure you undertstand the workflow, we’d like to first have you complete the analysis with a small dataset. We’ve taken 50,000 reads from the read1 and read2 files of each assigned sample. They are labeled with the prefix mini_ and located in the reads directory (files without the prefix mini are the original, full-size files, which each contain millions of reads.)

Data from each Metasub sample is contained in two files—read1 (i.e. R1) and read2 (i.e.R2). This is because the researchers paired-end sequenced each sample. When samples are sequenced on an Illumina genomic sequencer researchers get to decide: (1) the number of reads to generate for each sample library; (2) the read length (within a range of ~50 - ~300 nucleotides); and (3) whether to generate a single read or a pair of reads from each sequenced DNA library fragment.

Here is an diagram that summarizes paired-end versus single-end sequencing:

A FASTQ record has the following format:

A line starting with @, containing the sequence ID.
One or more lines that contain the sequence.
A new line starting with the character +, and being either empty or repeating the sequence ID.
One or more lines that contain the quality scores.

Here’s an example of a FASTQ file with two records:

@071112_SLXA-EAS1_s_7:5:1:817:345
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
@071112_SLXA-EAS1_s_7:5:1:801:338
GTTCAGGGATACGACGTTTGTATTTTAAGAATCTGA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII6IBI

Navigate to the reads directory (which contains the “mini” fastq data):

$ cd ~/reads/

PLEASE NOTE: make sure to complete all steps in this section from the reads directory.

Take a look at the first 10 lines of your assigned fastq files:

$ zcat mini_my_reads.R1.fastq.gz | head

$ zcat mini_my_reads.R2.fastq.gz | head

(Please note: mini_my_reads is a placeholder for your assigned miniature reads dataset.)

5.5.1 Look at your assigned dataset: See one, do one

Save space in your document using buttons or tabs for sub chapters. Add this code at the end of your title:

See One (Example)
Sample: `mini_SL342389`

Take a look at the first 10 lines of your assigned fastq files:

$ zcat mini_SL342389.R1.fastq.gz | head

$ zcat mini_SL342389.R2.fastq.gz | head

Execute with your sample
Sample: `yourSample`

Take a look at the first 10 lines of your assigned fastq files:

$ zcat mini_my_reads.R1.fastq.gz | head

$ zcat mini_my_reads.R2.fastq.gz | head

(Please note: mini_my_reads is a placeholder for your assigned miniature reads dataset.)

`Checkpoint (CP) 3`

In the Google Sheet, please:

Upload a screenshot of your zcat results.
Answer these questions:

1. How are fastq files are structured? (i.e. What does line 1, line 2, line 3, and line 4, of each read signify?)
1. Why are there two reads files for each sample?

5.6 Remove adapters and filter-out low quality reads

Current step Linux Steps R Steps

Check AdapterRemoval is working by opening your terminal and typing:

$ AdapterRemoval -h

You should see a long list of all the arguments that can be given to this tool.

Run AdapterRemoval on your mini sample (executable version):

$ AdapterRemoval --file1 mini_my_reads.R1.fastq.gz --file2 mini_my_reads.R2.fastq.gz --trimns --trimqualities --minquality 40 --threads 2 --gzip --output1 clean.mini_my_reads.R1.fastq.gz  --output2 clean.mini_my_reads.R2.fastq.gz

Please note: mini_my_reads is a placeholder for your assigned sample ID. The file extension .fq is a widely used abbreviation for .fastq.

Run AdapterRemoval on your sample (explanatory version):

$ AdapterRemoval
    --file1 mini_my_reads.R1.fastq.gz --file2 mini_my_reads.R2.fastq.gz  
    --trimns  
    --trimqualities --minquality 40 
    --threads 2  
    --gzip  
    --output1 clean.mini_my_reads.R1.fastq.gz  --output2 clean.mini_my_reads.R2.fastq.gz

Calls the program you want to invoke.
this tells the program where to find our input data.
this tells the program to remove reads with ambiguous bases or ’N’s
this tells the program to remove reads with low-quality bases
this tells the computer to multithread the program with 2 cores
this tells the program that both the input and output files are compressed with a program call gzip
this tells the program where to put its output

5.6.1 Look at your assigned dataset: See one, do one

Save space in your document using buttons or tabs for sub chapters. Add this code at the end of your title:

See One (Example)
Sample: `mini_SL342389`

Run AdapterRemoval on your mini sample (executable version):

$ AdapterRemoval --file1 mini_SL342389.R1.fastq.gz --file2 mini_SL342389.R2.fastq.gz --trimns --trimqualities --minquality 40 --threads 2 --gzip --output1 clean.mini_SL342389.R1.fastq.gz  --output2 clean.mini_SL342389.R2.fastq.gz

Execute with your sample
Sample: `yourSample`

Run AdapterRemoval on your mini sample:

$ AdapterRemoval --file1 mini_my_reads.R1.fastq.gz --file2 mini_my_reads.R2.fastq.gz --trimns --trimqualities --minquality 40 --threads 2 --gzip --output1 clean.mini_my_reads.R1.fastq.gz  --output2 clean.mini_my_reads.R2.fastq.gz

5.6.2 Now we have read files that are clean. Look at one of the output files by running:

Save space in your document using buttons or tabs for sub chapters. Add this code at the end of your title:

See One (Example)
Sample: `mini_SL342389`

Look at your clean… results

$ zcat clean.mini_SL342389.R1.fastq.gz | head

$ zcat clean.mini_SL342389.R2.fastq.gz | head

For comparison, look again at the first 10 lines of your raw reads:

$ zcat mini_SL342389.R1.fastq.gz | head

$ zcat mini_SL342389.R2.fastq.gz | head

Notice any differences between the Clean and Raw reads? If so, write them down.

Execute with your sample
Sample: `yourSample`

Look at your clean… results

$ zcat clean.mini_my_reads.R1.fastq.gz | head

$ zcat clean.mini_my_reads.R2.fastq.gz | head

For comparison, look again at the first 10 lines of your raw reads:

$ zcat mini_my_reads.R1.fastq.gz | head

$ zcat mini_my_reads.R2.fastq.gz | head

5.6.3

Notice any differences between the Clean and Raw reads? If so, write them down.

5.6.4 Now we have read files that are clean. Look at one of the output files by running:

Save space in your document using buttons or tabs for sub chapters. Add this code at the end of your title:

See One (Example)
Sample: `mini_SL342389`

Count the number of reads in each of your cleaned fastq files by running:

$ zcat clean.mini_SL342389.R1.fastq.gz | wc -l and manually dividing by four.

$ zcat clean.mini_SL342389.R2.fastq.gz | wc -l and manually dividing by four.

Count the number of reads in each of your raw (i.e. initial) fastq files by running:

$ zcat mini_SL342389.R1.fastq.gz | wc -l and manually dividing by four.

$ zcat mini_SL342389.R2.fastq.gz | wc -l and manually dividing by four.

Execute with your sample
Sample: `yourSample`

Count the number of reads in each of your cleaned fastq files by running:

$ zcat clean.mini_reads.R1.fastq.gz | wc -l and manually dividing by four.

$ zcat clean.mini_reads.R2.fastq.gz | wc -l and manually dividing by four.

Count the number of reads in each of your raw (i.e. initial) fastq files by running:

$ zcat mini_reads.R1.fastq.gz | wc -l and manually dividing by four.

$ zcat mini_reads.R2.fastq.gz | wc -l and manually dividing by four.

Record your Raw and Clean read counts to report them in CP4.

`Checkpoint (CP) 4`

In the Google Sheet, please: 1. Upload a screenshot of your zcat results. 2. Answer these questions:

1. Do you notice any differences in the read lengths between your Raw and Cleaned reads? If so, what?
1. Report the numbers of your Raw and Cleaned reads.

5.7 Not shown: Remove contaminating human DNA from your sample

Researchers often decide at this point to remove DNA from unwanted species, such as human. DNA that we don’t want is called “Contaminating DNA”. We are not going to perform this step due to time constraints.

5.8 Execute Metagenomic Analysis

Current step Linux Steps R Steps

Launch Kraken (executable version):

(base) user#@ip-#:~/reads $

$ conda install --channel bioconda --yes kraken

conda is a package manager
install is a subcommand of conda
--channel bioconda
--yes gives conda permission to install software
kraken is the program we are installing

You should now be able to test Kraken by running:

(base) user#@ip-#:~/reads $

kraken --help

--help is an option that displays the options for the tool

If a list of kraken options and explanations is displayed, it’s confirmation the tool is running. Provided this happnes, we can now perform taxonomic classification on our data using this command:

(base) user#@ip-#:~/reads $

kraken --gzip-compressed --fastq-input --threads 2 --paired --preload --db ~/databases/minikraken_20171101_8GB_dustmasked clean.mini_my_reads.R1.fastq.gz clean.mini_my_reads.R2.fastq.gz > mini_my_reads_taxonomic_report.csv

kraken command invocation --gzip-compressed indicates our reads are gzip compressed --fastq-input indicates our reads are formatted as fastq files --threads tells the compute instance how many threads to use --paired indicates we’re using paired end data --preload specifies to load the database into RAM before running --db specifies the path to the database clean.mini_my_reads.R1.fastq.gz filepath to Read1 file clean.mini_my_reads.R2.fastq.gz filepath to Read2 file mini_my_reads_taxonomic_report.csv filepath to file where kraken will place results

5.8.1 Execute metagonomics analysis: See one, do one

See One (Example)
Sample: `mini_SL342389`

(base) user#@ip-#:~/reads $

kraken --gzip-compressed --fastq-input --threads 2 --paired --preload --db ~/databases/minikraken_20171101_8GB_dustmasked clean.mini_SL342389.R1.fastq.gz clean.mini_SL342389.R2.fastq.gz > mini_SL342389_taxonomic_report.csv

Execute with your sample
Sample: `yourSample`

(base) user#@ip-#:~/reads $

kraken --gzip-compressed --fastq-input --threads 2 --paired --preload --db ~/databases/minikraken_20171101_8GB_dustmasked clean.mini_my_reads.R1.fastq.gz clean.mini_my_reads.R2.fastq.gz > mini_my_reads_taxonomic_report.csv

BE SURE to replace mysampleID with your assigned sample ID.

Once this command runs, take a look at the output by running:

$ head mini_my_reads_taxonomic_report.csv (the head command prints the first 10 lines of any file to the terminal)

It should look similar to this:

Note Column 1 indicates whether it was classified by Kraken or not. Column 2 denotes the Read ID (line 1 of the fastq read record). Subsequent columns indicated taxonomic mappings, if any.

The mini_my_reads_taxonomic_report.csv is formatted for storage efficiency and is not very useful for biological interpretation. Convert this file to an .mpa format so that it can be interpreted more easily:

$ kraken-mpa-report --db ~/databases/minikraken_20171101_8GB_dustmasked mini_my_reads_taxonomic_report.csv > mini_my_reads_final_taxonomic_results.mpa

Look at the first 20 lines by executing:

$ head -20 mini_my_reads_final_taxonomic_results.mpa

Look at the last 20 lines by executing:

$ tail -20 mini_my_reads_final_taxonomic_results.mpa

Pull out the species level assignments in your .mpa file and store them in a .tsv file:

$ grep 's__' mini_my_reads_final_taxonomic_results.mpa > mini_my_reads_species_only.tsv

Look at the first 10 lines:

$ head mini_my_reads_species_only.tsv

See how many species were identified:

$ wc -l mini_my_reads_species_only.tsv

`Checkpoint (CP) 5`

In the Google Sheet, please:

Upload a screenshot of your kraken command output. (i.e. the one that states the % sequences classified and unclassified)
Upload a screenshot of your head -20 mini_my_reads_final_taxonomic_results.mpa command.
Upload a screenshot of your tail -20 mini_my_reads_final_taxonomic_results.mpa command.
Answer these questions:

1. What do d, p, and other abbreviations stand for?
1. What do you the numbers at the end of each line signify?
1. Why do you think Kraken can’t characterize some reads at all?
1. Why do you think Kraken can’t characterize some reads down to the species level?
1. When identifying only the species-level taxonomic assignments, what “search term” do you use in your code?

5.9 Generate barplot using all kraken samples that you and your classmates analyzed

Now create a barplot with all of the metasub samples:

library(ggplot2)  

taxa_all_barplot = read.csv('kraken_all_metasub.csv', header=TRUE, sep=',') 

taxa_all_barplot = taxa_all_barplot[taxa_all_barplot$rank == 'phylum',] 

taxa_all_barplot = taxa_all_barplot[taxa_all_barplot$percent_abundance >= 2,] 

ggplot(taxa_all_barplot, aes(x=sample, y=percent_abundance, fill=taxon)) +
  geom_bar(stat="identity") +  
  xlab('Sample') + 
  ylab('Abundance') +
  labs(fill='Phylum') +
  theme(axis.text.x = element_text(angle = 90))

`Checkpoint (CP) 6`

Please upload the following to the Google Sheet:

Barplot at the phylum level for all samples. (To download this plot as a file, click the (+) Zoom button, right click on the plot and select “Save Image As”.)

5.10 Sort species abundance from highest to lowest

Read-in all of the Kraken data from the samples you and the other participants analyzed:

taxa_all = read.csv('kraken_all_metasub.csv', header=TRUE, sep=',')

taxa_all_spp = taxa_all[taxa_all$rank == 'species',] 

taxa_all_spp_mySampleID = taxa_all_spp[taxa_all_spp$sample == 'SampleID',]  

taxa_all_spp_mySampleID_ordered <- taxa_all_spp_mySampleID[order(taxa_all_spp_mySampleID$percent_abundance, decreasing = TRUE), ]  

head(taxa_all_spp_mySampleID_ordered, 5)

Read-in all metasub data
Filter down to species level
Sort to your sample. Be sure to specify your SampleID (e.g. SL342389) where indicated.
Order object from highest-to-lowest percent abundance
Print top 5 most abundant species in your sample to the Console

For your information, you can open taxa_all_spp_mySampleID as a table in the Scripting Pane by clicking on the name of the object in the Environment Pane. You can order rows based on the incresing and decreasing values of a specific column by clicking the ▲ and ▼ arrows of that column.

`Checkpoint (CP) 7`

In the Google Sheet:

List the top 5 most abundant species in your sample.

5.11 ID Potential Antibiotic Resistance Loci in the SPP in Your Assigned Sample

Navigate to the Bacterial and Viral Bioinformatics Resource Center (BV-BRC): https://www.bv-brc.org/

Use the tools in BV-BRC to ID at least one AMR locus/gene for the top 5 most abundant species in your sample.

Select Taxa option from the “All Data Types” dropdown menu
Search for a given species via the search bar
Select the “species” result (usually the first one) by checking the box next to that row
Click TAXON OVERVIEW from the green column
Click the Specialty Genes tab
Click the blue Filter button
For Property, select Antibiotic Resitance
For Source, select PATRIC
For Classification, search for efflux pump conferring antibiotic resistance using the magnifying glass feature
Select a few of the results rows by checking their boxes
Click the FASTA option in the green column and select View FASTA DNA This will show you the nucleotide sequences of the selected resistance loci.

Here’s a screencast using the species Staphylococcus aureus (Methicillin-resistant Staphylococcus aureus (MRSA) is a widely known antibiotic resistant hospital acquired infection):

`Checkpoint (CP) 8`

In the Google Sheet:

List three efflux pump conferring antibiotic resistance locus IDs for the species you queried using the BV-BRC.
Hypothesize how you would figure out if a resistant strain of one of the abundant species in your sample was present.

6 Biology + Data Science in the Classroom

While the linux steps can be challenging to perform on a local computuer, students should be able to perform all R/RStudio related steps.

R is free and open source. The Desktop version of RStudio is free.

To use RStudio, you must:

Download and install Base R.
Download and install RStudio

Please feel free to Download the taxa_table.csv and kraken_all_metasub.csv for your own use.

7 Data Science Principles Covered

Using commands (e.g. head, tail) to understand the structure/format of the data.
Use operators (e.g. $ or @) to filter or sort data by specific variables (i.e. columns)
Use commands like , summary(), density() and plot to understand the descriptive statistics and general distribution of a dataset.
Dataset sizes (footprints) tend to get smaller as one works through analyses downstream.
To compare samples visually, one often tries to reduce the dimensionality of the data.
If you’re having trouble running a command, try to first reproduce the sample code result.

8 Post-VTP Survey

We would greatly appreciate it if you could complete the POST-VTP Survey: https://forms.gle/ogTzjPW61S4WfVou6

8.1 one

8.2 two

9 Use buttons or tabs for sub-chapters

#{.tabset .tabset-fade .tabset-pills}

Save space in your document using buttons or tabs for sub chapters. Add this code at the end of your title:

See One (Example)
sample

$ AdapterRemoval --file1 mini_my_reads.R1.fastq.gz --file2 mini_my_reads.R2.fastq.gz --trimns --trimqualities --minquality 40 --threads 2 --gzip --output1 clean.mini_my_reads.R1.fastq.gz  --output2 clean.mini_my_reads.R2.fastq.gz

Do one (repeat example)
sample

content of sub-chapter #2

Run AdapterRemoval on your mini sample

$ AdapterRemoval --file1 mini_my_reads.R1.fastq.gz --file2 mini_my_reads.R2.fastq.gz --trimns --trimqualities --minquality 40 --threads 2 --gzip --output1 clean.mini_my_reads.R1.fastq.gz  --output2 clean.mini_my_reads.R2.fastq.gz

Please note: mini_my_reads is a placeholder for your assigned sample ID. The file extension .fq is a widely used abbreviation for .fastq.

Redo One (your sample)

content of sub-chapter #3

Run AdapterRemoval on your mini sample (executable version):

$ AdapterRemoval --file1 mini_my_reads.R1.fastq.gz --file2 mini_my_reads.R2.fastq.gz --trimns --trimqualities --minquality 40 --threads 2 --gzip --output1 clean.mini_my_reads.R1.fastq.gz  --output2 clean.mini_my_reads.R2.fastq.gz

Please note: mini_my_reads is a placeholder for your assigned sample ID. The file extension .fq is a widely used abbreviation for .fastq.

When dollars appear it’s a sign
  that your code does not quite align
Ensure that your math
  in xaringan hath
  been placed on a single long line

10 Mission

And at the end,

10.1 Quarter

Quarterly reports

Metagenomics + Microbial Surveillance Virtual Training Project

MILRD

1 Getting Started

1.1 RStudio, Assigned Dataset, Checkpoint Information

1.2 Slack

1.3 Troubleshooting

2 Introduction to the project.

3 Rationale for the MetaSub Project

4 Overview: bioinformatics analyses

5 Bioinformatics Tasks

5.1 Login to your RStudio Instance

5.2 Generate barplot(s) using subset of metagenomic samples

5.2.1 Generate Barplot: See one, do one

See One (Example) Sample: mini_SL342389

Execute with your sample Sample: yourSample

Execute chunk-by-chunk Sample: yourSample

Create: family-level barplot Sample: yourSample

Checkpoint (CP) 1

5.3 Access your Linux terminal via RStudio

5.4 Navigating your Linux terminal.

Checkpoint (CP) 2

5.5 Look at your assigned dataset (time estimate: 10 min)

5.5.1 Look at your assigned dataset: See one, do one

See One (Example) Sample: mini_SL342389

Execute with your sample Sample: yourSample

Checkpoint (CP) 3

5.6 Remove adapters and filter-out low quality reads

5.6.1 Look at your assigned dataset: See one, do one

See One (Example) Sample: mini_SL342389

Execute with your sample Sample: yourSample

5.6.2 Now we have read files that are clean. Look at one of the output files by running:

See One (Example) Sample: mini_SL342389

Execute with your sample Sample: yourSample

5.6.3

5.6.4 Now we have read files that are clean. Look at one of the output files by running:

See One (Example) Sample: mini_SL342389

Execute with your sample Sample: yourSample

Checkpoint (CP) 4

5.7 Not shown: Remove contaminating human DNA from your sample

5.8 Execute Metagenomic Analysis

5.8.1 Execute metagonomics analysis: See one, do one

See One (Example) Sample: mini_SL342389

Execute with your sample Sample: yourSample

Checkpoint (CP) 5

5.9 Generate barplot using all kraken samples that you and your classmates analyzed

Checkpoint (CP) 6

5.10 Sort species abundance from highest to lowest

Checkpoint (CP) 7

5.11 ID Potential Antibiotic Resistance Loci in the SPP in Your Assigned Sample

Checkpoint (CP) 8

6 Biology + Data Science in the Classroom

7 Data Science Principles Covered

8 Post-VTP Survey

8.1 one

8.2 two

9 Use buttons or tabs for sub-chapters

See One (Example) sample

Do one (repeat example) sample

Redo One (your sample)

10 Mission

10.1 Quarter

See One (Example)
Sample: `mini_SL342389`

Execute with your sample
Sample: `yourSample`

Execute chunk-by-chunk
Sample: `yourSample`

Create: family-level barplot
Sample: `yourSample`

`Checkpoint (CP) 1`

`Checkpoint (CP) 2`

See One (Example)
Sample: `mini_SL342389`

Execute with your sample
Sample: `yourSample`

`Checkpoint (CP) 3`

See One (Example)
Sample: `mini_SL342389`

Execute with your sample
Sample: `yourSample`

See One (Example)
Sample: `mini_SL342389`

Execute with your sample
Sample: `yourSample`

See One (Example)
Sample: `mini_SL342389`

Execute with your sample
Sample: `yourSample`

`Checkpoint (CP) 4`

See One (Example)
Sample: `mini_SL342389`

Execute with your sample
Sample: `yourSample`

`Checkpoint (CP) 5`

`Checkpoint (CP) 6`

`Checkpoint (CP) 7`

`Checkpoint (CP) 8`

See One (Example)
sample

Do one (repeat example)
sample