Pre-VTP Survey

Please complete the Pre-VTP Survey.

Introduction

VTP Goal

The primary goal of this Virtual Training Project (VTP) is to use the Linux and R programming languages to bioinformatically characterize, quantify and visualize the species composition of urban microbiome samples (i.e. subway swabs) from raw data to completed analysis.

What is bioinformatics

Why is the metagenomics application of bioinformatics useful?

In case you are thinking, "How would someone be able to use this knowledge in real life?", here are three real-world applications of the metagenomics analysis technique you will learn:

Research Applications: you can ask questions about the similarity of microbiome samples, the prevalence/emergence of antimicrobial resistant strains, etc. The MetaSUB Project, which is where the data you'll analyze in this VTP comes from, would be an example of this application.
Clinical Applications: This technique is already being employed by some biotech startups, like Karius, to rapidly diagnose blood-borne infections.
Biotech/Industry Applications: Some companies offer microbial surveillance services to identify and monitor the presence of resistant pathogens (e.g. on hospital surfaces). One such startup called Biotia was spun out of the Mason lab (the same Cornell Med lab that created the MetaSUB project) and utilizes many of the same techniques that you will employ in this VTP."

In a broader sense, bioinformatics is a subdivision of data science; principles learned in the former are highly relevant to the latter. We'll point these out as we go along.

Data Source

The source of the data you will analyze in this VTP is from the MetaSUB Project, an effort to characterize the "built environment" microbiomes of mass transit systems around the world, headed by Dr. Chris Mason's Lab at Weill Cornell Medical Center.

What is MetaSUB

MetaSUB an international consortium of laboratories created to establish a world-wide "DNA map" of microbiomes in mass transit systems. It is succeeded a previous study

MetaSUB consists of a group of labs from all over the world working together to study the microbes in mass transit systems like bacteria, viruses, and archaea (i.e. their microbiomes) and make a "DNA map" of sequences from them. The project was started by the Mason lab and is an extension of a previous study called Pathomap. Pathomap only looked at the NYC subway system and aimed to create a profile of what's normal, identify potential bio-threats, and use the data to make the city better for people's health. MetaSUB recognized that other cities around the world could benefit from a similar study and decided to expand the project to include subway systems worldwide.

How do MetaSUB Researchers Collect, Process, and Analyze Samples

Take a look at this figure, which provides an overview of MetaSUB's design and execution:

As you can see, the researchers collected samples from

(A) New York City's five boroughs

(B) Collected samples from the 466 subway stations of NYC across the 24 subway lines

(D) Mapped the distribution of taxa identified from the entire pooled dataset, and

(E) presented geospatial analysis of a highly prominent genus, Pseudomonas

Notably, as seen in (D), nearly half of the DNA sequenced (48%) did not match any known organism, underscoring the vast wealth of likely unknown organisms that surround passengers every day.

Now, let's focus on Fig 1C, this is shows at a high level the steps to process and analyze samples. We'll be doing a very similar analysis in this VTP but not identical.

Here's a simplified version of sample processing and sequencing:

Alright, before moving onto the next section, here are some more resources in case you want to review them later:

Here's the MetaSUB paper in case you would like to review it later. To be frank, this article is a bit challenging to read, so we suggest you review it later on if you're inclined, after you've done a bit of the VTP.
New York Times article about the MetaSUB Project.
Here's the Pathomap paper, the predecessor study that only investigated the NYC Subway Microbiome.

Overview: Bioinformatics Analyses

Summary

Throughout this VTP, you will characterize, quantify and visualize microbial metagenomics data from sequenced swabs of public urban environments on your own cloud-hosted High Performance Compute Instance.

In the Linux terminal, you will learn the structure of raw genomic data and perform taxonomic characterization & abundance quantification. In R, you will determine the mose abundant species in your sample and visualize results at several taxonomic levels.

Here's a flowchart of the analysis you'll perform:

%% Sequence diagram code flowchart LR id1(Look: assigned dataset)-->id2(Metagenomics Analysis); id2-->id3(Identify: Abundant Species); id3-->id4(Barplots); style id1 fill:#ddd, stroke:#524,stroke-width:4px; style id2 fill:#ddd, stroke:#524,stroke-width:4px; style id3 fill:#7be, stroke:#524,stroke-width:4px; style id4 fill:#7be, stroke:#524,stroke-width:4px;

Linux Step R Step Current Step

Cloud Computing

All bioinformatics tasks will be performed in "the cloud" on your own Amazon Web Services (AWS) hosted high performance compute instance.

What is cloud computing:

What is R and RStudio

For this project, you will execute all Linux and R tasks in RStudio.

RStudio is an integrated development environment (IDE) for the open-source R programming language, which basically means it brings the core components of R (Scripting Pane, Console, Environment Pane, File Manager/Plots/Help Pane) into a quadrant-based user interface for efficient use.

RStudio also comes with a Linux terminal built-in, where you can execute linux commands.

Access your Linux terminal by logging into RStudio via the web browser (can be found in the the Getting Started section.

Here is what the RStudio dashboard looks like:

Bioinformatics Tasks

Analysis Approach

See One, Do One, Repeat: In each step of this analysis, you will first see the analysis performed with Example Sample (mini_SL342389). You will be asked to reproduce the step with that same sample and confirm you get the same results. You will then be asked to execute the step with your own assigned sample, which is found in the Getting Started section.
Do not get hung up on coding syntax: our goal is not to teach you a programming language or languages from scratch. Our goal is for you to understand the inputs, parameters, outputs, and how to interpret the output. Some participants find it helpful to think of each step as a mathematical function (e.g. y = mx + b): regardless of how complicated a block of code or command is, you are always putting data in and getting data out.

Setup to your Linux environment:

It should look like this once you're finished setting up your Linux environment:

Navigating your Linux terminal.

Linux is a command-line based operating system. For our purposes Linux = Unix.

The first few steps of this projects are performed in Linux. Linux is a powerful operating system used by many scientists and engineers to analyze data. Linux commands are written on the command line, which means that you tell the computer what you want it to do by typing, not by pointing and clicking like you do in operating systems like Windows, Chrome, iOS, or macOS. The commands themselves are written in a language called BaSH.

Here's a brief tutorial that covers some concepts and common commands using a sample code.

Concepts

Command Line

The Linux command line is a way of interacting with the computer's operating system by typing in commands in a text-based interface. It's a bit like talking to your computer using a special language, and that language is called Bash.

The command line in the Terminal tab should look something like this:

(base) your-user#@your-AWS-IP-address:~
$

The $ (or %) symbol that you might see in a command line is called a prompt. It's a sign that the command line is ready to accept your input. When you see the prompt, you can start typing in a command.

(base) your-user#@your-AWS-IP-address:~
$ linux-commands-are-typed here

Aside: in some Linux Terminals, the prompt username, computer name, and/or IP address are listed on the same line.

(base) your-user#@your-AWS-IP-address:~ $

This is a difference in Terminal style, and does not impact execution of commands.

Using the command line is different than using a graphical user interface (GUI) because instead of clicking on icons and buttons to interact with your computer, you type in text commands. It can take some time to learn the commands and syntax of the command line, but once you do, you can often do things faster and more efficiently than with a GUI. Some people also prefer using the command line because it gives them more control over their computer's settings and can be more powerful for certain tasks.

Let's type a command in the command line.

List the files and directories in the current directory using the ls command:

ls

You should see something like this output to the screen:

R                       mini_SL335868_taxa.csv  mini_SL336556_taxa.csv  mini_SL342390_taxa.csv  miniconda3
databases               mini_SL335869_taxa.csv  mini_SL336557_taxa.csv  mini_SL345999_taxa.csv  pass.txt
kraken_all_metasub.csv  mini_SL335870_taxa.csv  mini_SL336558_taxa.csv  mini_SL346000_taxa.csv  reads
mini_SL254773_taxa.csv  mini_SL335871_taxa.csv  mini_SL336559_taxa.csv  mini_SL346001_taxa.csv  rstudio
mini_SL254775_taxa.csv  mini_SL335872_taxa.csv  mini_SL336560_taxa.csv  mini_SL346002_taxa.csv  taxa_table.csv
mini_SL254810_taxa.csv  mini_SL335873_taxa.csv  mini_SL336561_taxa.csv  mini_SL346003_taxa.csv
mini_SL254820_taxa.csv  mini_SL335875_taxa.csv  mini_SL336562_taxa.csv  mini_SL346004_taxa.csv
mini_SL324391_taxa.csv  mini_SL336555_taxa.csv  mini_SL342389_taxa.csv  mini_SL346005_taxa.csv

It looks like a lot, but don't fret, this is just a bunch of files and directories (i.e. folders). You'll notice the Terminal colors by their type. For example, directories are colored blue. A directory is computer-science speak for a folder. So this tells you what folder you are currently sitting in.

Directory As we have mentioned, a directory is another name for a folder in which files are stored.

If you look immediately to the left of the $, you will see what is called the "working directory". The ~ symbol has a special meaning in Linux systems: it refers to the home directory on your system.

Navigate to the reads directory using the cd command.

cd ~/reads/

After you execute the command, your command line should look like this:

(base) user2@ip-172-31-25-174:~/reads
$

Now, use the ls command to list the files and directories in this directory:

ls

You should see many files, most of which end in the extension fastq.gz:

SL254773.R1.fastq.gz  SL335875.R1.fastq.gz  SL342390.R1.fastq.gz       mini_SL254775.R1.fastq.gz  mini_SL336555.R1.fastq.gz  mini_SL342391.R1.fastq.gz
SL254775.R1.fastq.gz  SL336555.R1.fastq.gz  SL342391.R1.fastq.gz       mini_SL254810.R1.fastq.gz  mini_SL336556.R1.fastq.gz  mini_SL345999.R1.fastq.gz
SL254810.R1.fastq.gz  SL336556.R1.fastq.gz  SL345999.R1.fastq.gz       mini_SL254820.R1.fastq.gz  mini_SL336557.R1.fastq.gz  mini_SL346000.R1.fastq.gz
SL254820.R1.fastq.gz  SL336557.R1.fastq.gz  SL346000.R1.fastq.gz       mini_SL335868.R1.fastq.gz  mini_SL336558.R1.fastq.gz  mini_SL346001.R1.fastq.gz
SL335868.R1.fastq.gz  SL336558.R1.fastq.gz  SL346001.R1.fastq.gz       mini_SL335869.R1.fastq.gz  mini_SL336559.R1.fastq.gz  mini_SL346002.R1.fastq.gz
SL335869.R1.fastq.gz  SL336559.R1.fastq.gz  SL346002.R1.fastq.gz       mini_SL335870.R1.fastq.gz  mini_SL336560.R1.fastq.gz  mini_SL346003.R1.fastq.gz
SL335870.R1.fastq.gz  SL336560.R1.fastq.gz  SL346003.R1.fastq.gz       mini_SL335871.R1.fastq.gz  mini_SL336561.R1.fastq.gz  mini_SL346004.R1.fastq.gz
SL335871.R1.fastq.gz  SL336561.R1.fastq.gz  SL346004.R1.fastq.gz       mini_SL335872.R1.fastq.gz  mini_SL336562.R1.fastq.gz  mini_SL346005.R1.fastq.gz
SL335872.R1.fastq.gz  SL336562.R1.fastq.gz  SL346005.R1.fastq.gz       mini_SL335873.R1.fastq.gz  mini_SL342389.R1.fastq.gz
SL335873.R1.fastq.gz  SL342389.R1.fastq.gz  mini_SL254773.R1.fastq.gz  mini_SL335875.R1.fastq.gz  mini_SL342390.R1.fastq.gz
...

Now, let's say you need to go back to the home ~ directory:

cd ~

You should once again be back in the home ~ directory. It should look like this:

(base) your-user#@your-AWS-IP-address:~
$

Now, go back to the reads directory.

cd ~/reads/

Sometime we need to create a new directory to store files around a project, like when you create a new folder on your computer.

Let's create a new directory called test_directory in the current directory using the mkdir command.

mkdir test_directory

Execute ls to confirm that the test_directory directory was indeed created.

Enter the test_directory directory.

cd test_directory

Your terminal should look like this:

(base) user2@ip-172-31-25-174:~/reads/test_directory
$

Return to the ~/reads directory:

cd ~/reads

List the last last 15 commands run in your terminal so you can submit it in the next checkpoint:

history 15

There's a lot more that we could cover about working with the Linux command line, but this is enough to get started with your bioinformatics analysis.

Checkpoint (CP) 1

Please take a paste your history 15 output in the Google Sheet.

Analysis Strategy

In this training project, you will execute each step with an Example Sample mini_SL342389 and then your own assigned sample (Found in the Google Sheet).

Look at assigned dataset

%% Sequence diagram code flowchart LR id1(Look: assigned dataset)-->id2(Metagenomics Analysis); id2-->id3(Identify: Abundant Species); id3-->id4(Barplots); style id1 fill:#ddd, stroke:#fd0,stroke-width:4px; style id2 fill:#ddd, stroke:#524,stroke-width:4px; style id3 fill:#7be, stroke:#524,stroke-width:4px; style id4 fill:#7be, stroke:#524,stroke-width:4px;

Linux Step R Step Current Step

A note on your Sample

To ensure each step doesn't take too long, we've taken 50,000 reads from each assigned sample. They are labeled with the prefix mini_ and located in the reads directory (files without the prefix mini are the original, full-size files, which each contain millions of reads.)

How read data is generated:

Here is an diagram that illustrates how genomic sequencing data (i.e. reads) is generated:

FASTQ Format

The raw genomic data for this project is in fastq formatted files. Fastq files have the extension .fastq.

A fastq file is a collection of fastq records.

Here's an example of a Fastq record:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

A FASTQ record has the following format:

A line starting with \@, containing the sequence ID.
One or more lines that contain the sequence.
A new line starting with the character +, and being either empty or repeating the sequence ID.
One or more lines that contain the quality scores.

Here's an example of a FASTQ file with two records:

@071112_SLXA-EAS1_s_7:5:1:817:345
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
@071112_SLXA-EAS1_s_7:5:1:801:338
GTTCAGGGATACGACGTTTGTATTTTAAGAATCTGA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII6IBI

Run Step: Look at assigned dataset

Universal Step (not sample dependent)

First, we have a universal step that isn't specific to any particular sample. We change to the "reads" directory by typing:

cd ~/reads/

These instructions are written assuming all Linux steps take place from the reads directory.

Execute with Example Sample mini_SL342389

We're going to use the head and zcat commands to look at the first few lines of our reads files.

To see the first few lines of the read 1 file, we use the zcat and head commands:

zcat mini_SL342389.R1.fastq.gz | head

It should look like this:

$ zcat mini_SL342389.R1.fastq.gz | head
@E00438:418:HMGW3CCXY:2:1101:20547:1467 1:N:0:NAGTACGC
GGTTGCGGAGCCGGGCGAGGTAGCGGGCGCGGATCGCGGGGGCGGGGGGCAGGTCGCCCGTGGTTTCGCCGAGCCAGGGGTTGGCGAAGGTTGCGCCTTGCAGGCCGGCTTGCATCCCAAAAAGGGCGAGCG
+
AAAFJFJJFJJ7FJJFJJJJJJJJJJJJFFJJJJAJJJJJJJAFJJJJJJFFJ<AJAJJJJJJJJJJJJJJFJ7A<JJJJJJJJFJJJJJ7JJFJF<AJJAFJAAJFJFJJJ<-AFJFFJJJFFFF7FFJFF
@E00438:418:HMGW3CCXY:2:1101:14387:1309 1:N:0:NAGTACGC
CGCGCGTGTACACGCTGAAGTCCTGCCAGCGGTGGCTGCGGCTGAAGCTGGCCTGCCAGGTCCGGCTGAGCACGGGCTCGCGGCGCGGCTCCAGCGGACCGGGCGGCAGCAAGGGATCCAGCCCGAACAACGGGTCCTGCTCGAATTCAG
+
AAAFJJJJJJJJJJ<JJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJ<FJJJFJJJJFJJJJJJJJFJJJJFFJJJJJJJJJJJJJJFJJJJFJJJJJ<JJJJJJJJJJJ<J<JFJJFJJFFFAJAFF<<FJJFAJJJ<AJF7FA
@E00438:418:HMGW3CCXY:2:1101:13595:1766 1:N:0:NAGTACGC
GTCATTGTCAAACCCTCGGAAGAAACACCACTCACAACGACCTTGTTAGGTGAAGTTATGAATGAGGCAGG

The and "zcat" helps us read compressed files (the ".gz" extension means the file is compressed). The vertical bar "|" is a "pipe" that sends the output of the "zcat" command to the "head" command, which shows the first few lines of the output. If the file wasn't compressed you could simply type head name-of-file and it would output the first 10 lines to the screen.

This command "zcat" is like "cat", which shows the contents of a file, but it can read compressed files (the ".gz" extension means the file is compressed). The vertical bar "|" is a "pipe" that sends the output of the "zcat" command to the "head" command, which shows the first few lines of the output. You can still execute head name-of-compressed-file, but it will output the first 10 lines of the compressed file, witch will look like a bunch of gobbledy-gook (i.e. not human-iterpretable).

Execute with Your Sample

Now that you've looked at the example, let's view the first few lines of your read file for our own sample (which we're calling "mini_your_reads").

Please note: mini_your_reads is a placeholder. You need yo replace this text with name of your assigned sample:)

Look at the first few lines of your sample's read 1 file:

zcat mini_your_reads.R1.fastq.gz | head

Takeaways:

Data scientists like bioinformaticians, computer scientists, and folks in related professions will often use commands like head and zcat to look at a few lines of a file to understand it's structure or check to make sure it looks like it's supposed to.
You might ask: why not open the whole file? Two reasons: (1) files in data science are often too large to open (they can be 10s, 100s or even 1000s of gigabytes); (2) even if it's possible to open a file, it's often easier to look at the first few lines if your goal is to check or understand format.
It's important to understand the structure/format of the data you are analyzing.

Checkpoint (CP) 2

In the Google Sheet, please:

Upload screenshot of your zcat results from the Example Sample
Upload screenshot of your zcat results from Your Sample
How are fastq files are structured? (i.e. What does line 1, line 2, line 3, and line 4, of each read signify?)

Metagenomics Analysis

%% Sequence diagram code flowchart LR id1(Look: assigned dataset)-->id2(Metagenomics Analysis); id2-->id3(Identify: Abundant Species); id3-->id4(Barplots); style id1 fill:#ddd, stroke:#524,stroke-width:4px; style id2 fill:#ddd, stroke:#fd0,stroke-width:4px; style id3 fill:#7be, stroke:#524,stroke-width:4px; style id4 fill:#7be, stroke:#524,stroke-width:4px;

Linux Step R Step Current Step

Kraken

In this step, we're going to use a tool called Kraken to identify and quanity the microrganmisms (bacteria, archaea, and viruses in our case) based on the sequences of the dna fragments in our sample. Kraken is a software tool used in metagenomics to identify and classify the microorganisms present in a mixed DNA sample.

It is like a detective that helps scientists figure out which living things are in a sample, such as a swab from a subway, dirt, water, or even poop. It looks at tiny bits of DNA from the sample and compares them to a big library of DNA from lots of different living things, like bacteria, viruses, and fungi. Kraken then tells scientists which living things are most likely to be in the sample based on the similarities between the DNA in the sample and the DNA in the library.

How Kraken Works

Kraken uses a method called "k-mer" analysis to identify and quantify species.

To explain this in simple terms, imagine you have a long sentence made up of lots of words. A k-mer analysis would look at all the short "words" or "letter sequences" of a certain length (let's say 5 letters) in that sentence. Each unique sequence of 5 letters is called a "k-mer".

In the case of Kraken, it looks at the DNA sequences of all the organisms in a sample (which might be from a soil sample, for example) and it breaks them down into lots of short k-mers. It then compares these k-mers to a database of known k-mers from lots of different organisms.

If a particular k-mer from the sample matches a k-mer from a known organism in the database, then Kraken can say that it's likely that the organism is present in the sample. By looking at all the k-mers in the sample and comparing them to the database, Kraken can identify which organisms are likely to be present in the sample and how abundant they are.

So essentially, Kraken is like a detective that looks for clues (in the form of k-mers) to figure out which organisms are present in a sample and how many of them there are.

Run Step: Metagenomics Analysis

Launch Kraken

Universal Step

This is a universal step, you only need to run it one time to install the software.

Install kraken software.

conda install --channel bioconda --yes kraken

Confirm kraken was sucessfully installed.

kraken --help

This command displays the help menu for the Kraken software tool, which explains how to use the tool and what options it offers. Don't worry about reading/understanding this. We're just trying using this command to confirm the software was installed: if it wasn't you'd see an error message, not information about how to use the tool.

Run Kraken

Execute with Example Sample mini_SL342389

kraken --gzip-compressed --fastq-input --threads 2 --preload --db ~/databases/minikraken_20171101_8GB_dustmasked mini_SL342389.R1.fastq.gz > mini_SL342389_taxonomic_report.csv

The output should look like this:

Loading database... complete.
50000 sequences (6.61 Mbp) processed in 44.426s (67.5 Kseq/m, 8.93 Mbp/m).
    7537 sequences classified (15.07%)
    42463 sequences unclassified (84.93%)

This command runs the Kraken tool on the sequencing data for Example Sample mini_SL342389.

--gzip-compressed specifies that the input files are compressed in the gzip format.
--fastq-input specifies that the input files are in the FASTQ format.
--threads 2 specifies that the tool should use 2 threads to speed up the analysis.
--preload loads the database into memory to speed up the analysis.
--db ~/databases/minikraken_20171101_8GB_dustmasked specifies the Kraken database to be used for the analysis.
mini_SL342389.R1.fastq.gz is the input files containing the reads.
> mini_SL342389_taxonomic_report.csv redirects the output of the tool to a CSV file named mini_SL342389_taxonomic_report.csv.

Look at the the output of the kraken command.

head mini_SL342389_taxonomic_report.csv

This command displays the first few lines of the output file mini_SL342389_taxonomic_report.csv, which contains the taxonomic classifications assigned to the reads by Kraken.

The output to your terminal should look like this:

U       E00438:418:HMGW3CCXY:2:1101:20547:1467  0       132     0:102
C       E00438:418:HMGW3CCXY:2:1101:14387:1309  1163399 150     0:21 1163399:1 0:6 1163399:1 0:26 1163399:1 0:4 1163399:1 0:12 1163399:1 0:46
U       E00438:418:HMGW3CCXY:2:1101:13595:1766  0       71      0:41
U       E00438:418:HMGW3CCXY:2:1101:15585:1309  0       150     0:120
U       E00438:418:HMGW3CCXY:2:1101:22323:1309  0       127     0:97
U       E00438:418:HMGW3CCXY:2:1101:16467:1327  0       150     0:120
U       E00438:418:HMGW3CCXY:2:1101:16518:1344  0       124     0:94
U       E00438:418:HMGW3CCXY:2:1101:16386:1362  0       140     0:110
U       E00438:418:HMGW3CCXY:2:1101:19015:1836  0       117     0:87
U       E00438:418:HMGW3CCXY:2:1101:19390:1362  0       150     0:120

This output is showing the results of a Kraken analysis, but is in a format that is not easy to interpret by a human. This is becuase it prizes a condensed data format over readibility.

There are four classifications: unclassified, classified, error, or ambiguous. The output shows an identifier for each sequence, the length of the sequence, and the path through the taxonomic tree that Kraken used to classify the sequence.

More Destailed Explanation

Each row represents a different DNA sequence that was analyzed.

The first column shows the classification of the sequence, which can be one of four things:

U: The sequence was unclassified, meaning it did not match anything in the Kraken database C: The sequence was classified, meaning it matched a specific taxonomic group in the Kraken database E: The sequence was an error, meaning there was some problem with the analysis ?: The sequence was ambiguous, meaning it matched multiple taxonomic groups in the Kraken database

The second column shows an identifier for the sequence, which is from row 1 of a fastq read record.

The remaining columns show the path through the taxonomic tree that Kraken used to classify the sequence. For each level of the tree, there is a pair of numbers separated by a colon. The first number is the ID of a taxonomic group in the database, and the second number is the number of bases in the sequence that were classified as belonging to that group. For example, "0:117" means that 117 base pairs in the sequence were classified as belonging to the root of the taxonomic tree (i.e. they did not match any specific taxonomic group), and "A:31" means that 31 base pairs were classified as belonging to the taxonomic group identified by the ID "A".

Overall, this output tells us how many sequences were classified, how long they were, and how Kraken classified them based on their similarity to sequences in the Kraken database.

But this isn't very reader friendly, so we're gonna convert this into format called metaphlan format (denoted by a .mpa file extension) which is more human-interpretable using a tool called kraken-mpa-report

kraken-mpa-report --db ~/databases/minikraken_20171101_8GB_dustmasked mini_SL342389_taxonomic_report.csv > mini_SL342389_final_taxonomic_results.mpa

kraken-mpa-report is a command-line tool that generates a report summarizing the results of a Kraken analysis in metaphlan (.mpa) format.
--db ~/databases/minikraken_20171101_8GB_dustmasked specifies the path to the Kraken database to use for the analysis. In this case, it is using a database called minikraken_20171101_8GB_dustmasked.
mini_SL342389_taxonomic_report.csv` is the input file containing the results of the Kraken analysis.
> redirects the output of the command to a file called mini_SL342389_final_taxonomic_results.mpa. This means that the output will be saved in a file instead of being displayed on the screen.

Prints the first 20 lines of the mini_SL342389_final_taxonomic_results.mpa output file to the terminal screen.

head -20 mini_SL342389_final_taxonomic_results.mpa

You should get this:

d__Viruses      20
d__Bacteria     7500
d__Archaea      7
d__Bacteria|p__Proteobacteria   3280
d__Bacteria|p__Fusobacteria     3
d__Bacteria|p__Acidobacteria    2
d__Bacteria|p__Verrucomicrobia  2
d__Bacteria|p__Kiritimatiellaeota       1
d__Bacteria|p__Planctomycetes   6
d__Bacteria|p__Bacteroidetes    126
d__Bacteria|p__Firmicutes       638
d__Bacteria|p__Deinococcus-Thermus      182
d__Bacteria|p__Actinobacteria   3151
d__Bacteria|p__Tenericutes      1
d__Archaea|p__Euryarchaeota     7
d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria    2121
d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria    582
d__Bacteria|p__Proteobacteria|c__Betaproteobacteria     439
d__Bacteria|p__Proteobacteria|c__Deltaproteobacteria    13
d__Bacteria|p__Proteobacteria|c__Epsilonproteobacteria  3

This Kraken .mpa output shows the number of sequences that were classified to different levels of the taxonomic hierarchy. Each line shows a taxonomic group and the number of sequences classified to that group. The letter before the double underscore (e.g. d__ for domain, p__ for phylum, c__ for class) indicates the level of the taxonomic hierarchy. For example, "d__Bacteria|p__Firmicutes" means that 674 sequences were classified as belonging to the phylum Firmicutes within the domain Bacteria. Overall, this output gives an idea of what types of organisms are present in the dataset and their relative abundance.

Print the last 20 lines of mini_SL342389_final_taxonomic_results.mpa to the terminal screen.

tail -20 mini_SL342389_final_taxonomic_results.mpa

Confirm you get this output:

d__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Kineosporiales|f__Kineosporiaceae|g__Kineococcus|s__Kineococcus_radiotolerans        3
d__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Jiangellales|f__Jiangellaceae|g__Jiangella|s__Jiangella_alkaliphila  1
d__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Jiangellales|f__Jiangellaceae|g__Jiangella|s__Jiangella_sp_DSM_45060 3
d__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Geodermatophilales|f__Geodermatophilaceae|g__Geodermatophilus|s__Geodermatophilus_obscurus   17
d__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Geodermatophilales|f__Geodermatophilaceae|g__Blastococcus|s__Blastococcus_saxobsidens        22
d__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Geodermatophilales|f__Geodermatophilaceae|g__Modestobacter|s__Modestobacter_marinus  48
d__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Nakamurellales|f__Nakamurellaceae|g__Nakamurella|s__Nakamurella_multipartita 9
d__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Nakamurellales|f__Nakamurellaceae|g__Nakamurella|s__Nakamurella_panacisegetis        3
d__Bacteria|p__Actinobacteria|c__Actinobacteria|g__Thermobispora|s__Thermobispora_bispora       1
d__Bacteria|p__Actinobacteria|c__Acidimicrobiia|o__Acidimicrobiales|f__Acidimicrobiaceae|g__Acidimicrobium|s__Acidimicrobium_ferrooxidans       1
d__Bacteria|p__Actinobacteria|c__Rubrobacteria|o__Rubrobacterales|f__Rubrobacteraceae|g__Rubrobacter|s__Rubrobacter_xylanophilus        1
d__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Eggerthellales|f__Eggerthellaceae|g__Eggerthella|s__Eggerthella_lenta        1
d__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Eggerthellales|f__Eggerthellaceae|g__Gordonibacter|s__Gordonibacter_urolithinfaciens 1
d__Bacteria|p__Actinobacteria|c__Thermoleophilia|o__Solirubrobacterales|f__Conexibacteraceae|g__Conexibacter|s__Conexibacter_woesei     2
d__Bacteria|p__Tenericutes|c__Mollicutes|o__Mycoplasmatales|f__Mycoplasmataceae|g__Mycoplasma|s__Mycoplasma_cynos       1
d__Archaea|p__Euryarchaeota|c__Halobacteria|o__Halobacteriales|f__Haloarculaceae|g__Halomicrobium|s__Halomicrobium_mukohataei   1
d__Archaea|p__Euryarchaeota|c__Halobacteria|o__Haloferacales|f__Haloferacaceae|g__Haloferax|s__Haloferax_volcanii       1
d__Archaea|p__Euryarchaeota|c__Halobacteria|o__Haloferacales|f__Haloferacaceae|g__Haloferax|s__Haloferax_mediterranei   1
d__Archaea|p__Euryarchaeota|c__Halobacteria|o__Natrialbales|f__Natrialbaceae|g__Haloterrigena|s__Haloterrigena_turkmenica       1
d__Archaea|p__Euryarchaeota|c__Halobacteria|o__Natrialbales|f__Natrialbaceae|g__Haloterrigena|s__Haloterrigena_daqingensis      1

Notice how these taxa are reported down to the species ("s__") level.

Execute with Your Sample

Run Kraken

kraken --gzip-compressed --fastq-input --threads 2 --preload --db ~/databases/minikraken_20171101_8GB_dustmasked mini_your_reads.R1.fastq.gz > mini_your_reads_taxonomic_report.csv

Look at the first 10 lines of your kraken output to ensure it looks correct.

head mini_your_reads_taxonomic_report.csv

Convert your kraken output to .mpa format.

kraken-mpa-report --db ~/databases/minikraken_20171101_8GB_dustmasked mini_your_reads_taxonomic_report.csv > mini_your_reads_final_taxonomic_results.mpa

Look at the firat 20 mines of .mpa file.

head -20 mini_your_reads_final_taxonomic_results.mpa

Look at last 20 mines of .mpa file.

tail -20 mini_your_reads_final_taxonomic_results.mpa

Extract Species Level Taxonomic Assignments

Pull out the species level assignments in your .mpa file and store them in a .tsv file:

Example Sample mini_SL342389

Use the grep command to search for species-level taxonomic assignments from kraken and put them in a new file.

grep 's__' mini_SL342389_final_taxonomic_results.mpa > mini_SL342389_species_only.tsv

grep: The command-line utility used to search for specified patterns within a file.
's__': The pattern being searched for within the file.
mini_SL342389_final_taxonomic_results.mpa: The input file being searched.
mini_SL342389_species_only.tsv: The output file where the results of the search are being saved.

Use the ls comand to confirm the mini_SL342389_species_only.tsv file was created.

Use the head utility to display the first few lines of the newly created mini_SL342389_species_only.tsv file:

head mini_SL342389_species_only.tsv

Confirm your output looks like this:

d__Viruses|f__Poxviridae|s__BeAn_58058_virus    6
d__Viruses|f__Baculoviridae|g__Alphabaculovirus|s__Lymantria_dispar_multiple_nucleopolyhedrovirus       2
d__Viruses|o__Caudovirales|f__Myoviridae|g__Kp15virus|s__Klebsiella_virus_KP15  1
d__Viruses|o__Caudovirales|f__Myoviridae|g__Secunda5virus|s__Aeromonas_virus_Aes508     1
d__Viruses|o__Caudovirales|f__Siphoviridae|s__Lactococcus_phage_WRP3    1
d__Viruses|o__Caudovirales|f__Siphoviridae|g__Pa6virus|s__Propionibacterium_virus_P104A 1
d__Viruses|s__Apis_mellifera_filamentous_virus  1
d__Viruses|o__Bunyavirales|f__Tospoviridae|g__Orthotospovirus|s__Pepper_chlorotic_spot_virus    1
d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Thiotrichales|f__Piscirickettsiaceae|g__Methylophaga|s__Methylophaga_nitratireducenticrescens   1
d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Pseudomonadales|f__Moraxellaceae|g__Acinetobacter|s__Acinetobacter_lwoffii      60

You want to confirn that each line is taxonomically classified down to the speciess level by confirming an s__ is listed.

Count the number of lines in the mini_SL342389_species_only.tsv file.

wc -l mini_SL342389_species_only.tsv

Confirm you get this output:

915 mini_SL342389_species_only.tsv

This result indicates you could 987 lines in your mini_SL342389_species_only.tsv file, which means you identified 987 species in Example Sample mini_SL342389 (since each line is one species).

The rationale behind executing these commands is to extract and isolate the taxonomic assignments at the species level from the original file and to obtain a count of the number of species identified in the sample. The grep command filters the original file to only include lines with species-level assignments, and the head command is used to quickly check the output file to ensure that the command has worked as intended. Finally, the wc command is used to obtain a count of the number of lines (i.e., species) in the output file.

Execute with Your Sample

grep 's__' mini_your_reads_final_taxonomic_results.mpa > mini_your_reads_species_only.tsv

head mini_your_reads_species_only.tsv

wc -l mini_your_reads_species_only.tsv

Checkpoint 3

In the Google Sheet, please:

Upload a screenshot of your kraken command output for the Example Sample. (i.e. the one that states the % sequences classified and unclassified)
Upload a screenshot of your kraken command output for Your Sample.
Upload a screenshot of your head -20 and tail -20 results for the Example Sample.
Upload a screenshot of your head -20 and tail -20 results for Your Sample.
Answer these questions:

a. What do d, p, and other abbreviations stand for? b. What do you the numbers at the end of each line signify? c. Why do you think Kraken can't characterize some reads at all? d. Why do you think Kraken can't characterize some reads down to the species level? e. When identifying only the species-level taxonomic assignments, what "search term" do you use in your code?

Introduction to R

For the next set of steps, we're going to use the R programming language.

Please make sure the RStudio user interface dashboard looks as follows: Console Pane (Lower Left Quadrant), Scripting Pane (Upper Left Quadrant), File Manager/Plots Display/Help-Tab Pane (Lower Right Quadrant), Environment/History-Tab Pane (Upper Right Quadrant).

In R, code is always executed via the Console, but you have the choice whether to execute that code in a Script (opened in the Scripting Pane) or directly in the Console.

Unless you need to quicly execute a one-liner (e.g. setting a working directory using the setwd command), you'll want to be writing your code in a script, highlighting it, and clicking Run to execute it in the console. This is because you can easily edit code in a script and re-run it. Once code is run in the Console it is not editable.

In R, code is always executed via the Console, but you have the choice whether to execute that code in a Script (opened in the Scripting Pane) or directly in the Console. Unless you need to quicly execute a one-liner (e.g. setting a working directory using the setwd command), you'll want to be writing your code in a script, highlighting it, and clicking Run to execute it in the console. This is because you can easily edit code in a script and re-run it. Code run in the Console it is not editable in the Console.

Let's complete an R Tutorial, partly based on this website.

Instructions First execute the codes in the Console. Then clear your Console Pane and Environment Panes using the "brush" icons, re-enter the same codes in a script in the Scripting Pane (there is an empty script created for you called "Untitled1"), highlight the codes and click "Run". The results should be the same as when executed in the Console, but you can now edit the script and re-run them as much as you'd like.

R Session After R is started, there is a console awaiting for input. At the prompt (>), you can enter numbers and perform calculations.

Add two numbers:

1 + 2

You should get:

[1] 3

Variable Assignment We assign values to variables with the assignment operator "=". Just typing the variable by itself at the prompt will print out the value. We should note that another form of assignment operator "<-" is also in use.

Let's try assigning a variable:

x = 1

If you execute just x you should see it print the value you assigned it. Let's try:

You should see:

[1] 1

Functions R functions are invoked by its name, then followed by the parenthesis, and zero or more arguments.

Let's try some common functions

mean() - calculates the arithmetic mean of a numeric vector:

vec <- c(1, 2, 3, 4, 5)
mean(vec) # should return 3

sum() - calculates the sum of a numeric vector

sum(vec) # should return 15

All text after the pound sign "#" within the same line is considered a comment:

1 + 1      # this is a comment

Extension Packages Sometimes we need additional functionality beyond those offered by the core R library. In order to install an extension package, you should invoke the install.packages() function at the prompt and follow the instruction.

When you want to use that library you need to invoke it in your script using the library() function so R knows it needs to open it.

Let's try this with the package ggplot2:

library(ggplot2)

Let's make a simple plot using the ggplot() function from the ggplot2 library:

# Load ggplot2 package
library(ggplot2)

# Create a data frame with two variables
data <- data.frame(x = c(1, 2, 3, 4, 5), y = c(2, 4, 5, 4, 6))

# Create a scatter plot using ggplot2
ggplot(data, aes(x = x, y = y)) + 
  geom_point() # Add points to the plot

Checkpoint 4

In the Google Sheet:

Post a screenshot showing your successfully executed code in the Console and Scripting Pane

Identify Abundant Species

%% Sequence diagram code flowchart LR id1(Look: assigned dataset)-->id2(Metagenomics Analysis); id2-->id3(Identify: Abundant Species); id3-->id4(Barplots); style id1 fill:#ddd, stroke:#524,stroke-width:4px; style id2 fill:#ddd, stroke:#524,stroke-width:4px; style id3 fill:#7be, stroke:#fd0,stroke-width:4px; style id4 fill:#7be, stroke:#524,stroke-width:4px;

Linux Step R Step Current Step

Run Step: Identify Abundant Species

Execute with Example Sample mini_SL342389

#This line reads in the mini_SL342389_species_only.tsv file using the read.csv() function. 
taxa_mini_SL342389_species = read.csv('~/reads/mini_SL342389_species_only.tsv', header=FALSE, sep='\t')

#This line orders the data in the taxa_mini_SL342389_species variable based on the values in the second column (V2) in descending order. 
taxa_mini_SL342389_species_ordered <- taxa_mini_SL342389_species[order(taxa_mini_SL342389_species$V2, decreasing = TRUE), ]  

#This line outputs the first five rows of the ordered data in the taxa_mini_SL342389_species_ordered variable.
head(taxa_mini_SL342389_species_ordered, 5)

The code reads in a file called mini_SL342389_species_only.tsv and stores the data in a variable called taxa_mini_SL342389_species. Then, it orders the data based on the second column (V2) in descending order and stores the new ordered data in a new variable called taxa_mini_SL342389_species_ordered. Finally, the code outputs the top 5 rows of the ordered data.

These five rows are the top 5 species in your dataset.

Execute with Your Sample

taxa_yourSample_species <- read.csv('~/reads/mini_your_reads_species_only.tsv', header=FALSE, sep='\t')

taxa_yourSample_species_ordered <- taxa_yourSample_species[order(taxa_yourSample_species$V2, decreasing = TRUE), ]  

head(taxa_yourSample_species_ordered, 5)

Checkpoint 5

In the Google Sheet:

List the top 5 most abundant species in the Example Sample.
List the top 5 most abundant species in Your Sample.

Barplots

Linux Step R Step Current Step

Run Step: Barplots

Execute with Example Sample mini_SL342389

Abundance of Phyla

library(ggplot2)  

taxa_mini_SL342389 <- read.csv('~/mini_SL342389_taxa.csv', header=TRUE, sep=',') 

phyla_mini_SL342389 <- taxa_mini_SL342389[taxa_mini_SL342389$rank == 'phylum',] 

phyla_mini_SL342389 <- phyla_mini_SL342389[phyla_mini_SL342389$percent_abundance >= 0.3,] 

ggplot(phyla_mini_SL342389 , aes(x=sample, y=percent_abundance, fill=taxon)) +
    geom_bar(stat="identity") +  
    xlab('Sample') + 
    ylab('Abundance') +
    labs(fill='Phylum') +
    theme(axis.text.x = element_text(angle = 90))

Abundance of Families

library(ggplot2)  

taxa_mini_SL342389 <- read.csv('~/mini_SL342389_taxa.csv', header=TRUE, sep=',') 

family_mini_SL342389 <- taxa_mini_SL342389[taxa_mini_SL342389$rank == 'family',] 

family_mini_SL342389 <- family_mini_SL342389[family_mini_SL342389$percent_abundance >= 0.3,] 

ggplot(family_mini_SL342389 , aes(x=sample, y=percent_abundance, fill=taxon)) +
    geom_bar(stat="identity") +  
    xlab('Sample') + 
    ylab('Abundance') +
    labs(fill='Family') +
    theme(axis.text.x = element_text(angle = 90))

Execute with Your Sample

Abundance of Phyla

library(ggplot2)  

taxa_yourSample <- read.csv('~/yourSample_taxa.csv', header=TRUE, sep=',') 

phylum_yourSample <- taxa_yourSample[taxa_yourSample$rank == 'phylum',] 

phylum_yourSample <- phylum_yourSample[family_yourSample$percent_abundance >= 0.3,] 

ggplot(phylum_yourSample, aes(x=sample, y=percent_abundance, fill=taxon)) +
    geom_bar(stat="identity") +  
    xlab('Sample') + 
    ylab('Abundance') +
    labs(fill='Phylum') +
    theme(axis.text.x = element_text(angle = 90))

Abundance of Families

library(ggplot2)  

taxa_yourSample <- read.csv('~/yourSample_taxa.csv', header=TRUE, sep=',') 

family_yourSample <- taxa_yourSample[taxa_yourSample$rank == 'family',] 

family_yourSample <- family_yourSample[family_yourSample$percent_abundance >= 0.3,] 

ggplot(family_yourSample, aes(x=sample, y=percent_abundance, fill=taxon)) +
    geom_bar(stat="identity") +  
    xlab('Sample') + 
    ylab('Abundance') +
    labs(fill='Family') +
    theme(axis.text.x = element_text(angle = 90))

All Samples

Barplot: Abundance of top phyla across all samples

library(ggplot2)  

taxa_all_barplot <- read.csv('~/kraken_all_metasub.csv', header=TRUE, sep=',') 

taxa_all_barplot <- taxa_all_barplot[taxa_all_barplot$rank == 'phylum',] 

taxa_all_barplot <- taxa_all_barplot[taxa_all_barplot$percent_abundance >= 0.3,] 

ggplot(taxa_all_barplot, aes(x=sample, y=percent_abundance, fill=taxon)) +
    geom_bar(stat="identity") +  
    xlab('Sample') + 
    ylab('Abundance') +
    labs(fill='Phylum') +
    theme(axis.text.x = element_text(angle = 90))

Checkpoint 6

In the Google Sheet:

Upload the Phylum Level barplot for the Example Sample.
Upload the Family Level barplot for the Example Sample.
Upload the Phylum Level barplot for Your Sample.
Upload the Family Level barplot for Your Sample.
Upload Barplot of top abundance phyla across all samples.

Identify resistance genes in your sample’s most abundant species

Run Step: Identify resistance genes

Execute with Your Sample

Identify resistance genes in your sample’s most abundant species

Navigate to the Bacterial and Viral Bioinformatics Resource Center (BV-BRC)

Use the tools in BV-BRC to ID at least one AMR locus/gene for the top 5 most abundant species in your sample.

Select Taxa option from the “All Data Types” dropdown menu
Search for a given species via the search bar
Select the “species” result (usually the first one) by checking the box next to that row
Click TAXON OVERVIEW from the green column
Click the Specialty Genes tab
Click the blue Filter button
For Property, select Antibiotic Resitance
For Source, select PATRIC
For Classification, search for efflux pump conferring antibiotic resistance using the magnifying glass feature
Select a few of the results rows by checking their boxes
Click the FASTA option in the green column and select View FASTA DNA This will show you the nucleotide sequences of the selected resistance loci.

Here’s a screencast using the species Staphylococcus aureus (Methicillin-resistant Staphylococcus aureus (MRSA) is a widely known antibiotic resistant hospital acquired infection):

Checkpoint 7

In the Google Sheet:

List three efflux pump conferring antibiotic resistance locus IDs for the species you queried using the BV-BRC. (Only report the Locus ID, which is denoted by a > symbol. Do not report the locus sequence.)
Explain how to deterimine if a resistant strain of one of the abundant species in your sample is present.

Wrapping Up

So, at this point, you've made it through a bioinformatic analysis of a metagenomic dataset. Well done!

At this point, we want to show you how what you did compared to what a professional computational biologist would do. Becuase we had limited time, and had so much material to get through, we simplified the analysis you completed by removing some steps:

Normally, professional computational biologists would take the raw data through a quality control process, where they review the quality scores of the nucleotides and trim poor-quality nucleotides (if it's just a few) and remove whole reads from the dataset if it constitutes a lot of the read.
In metagenomics analysis there are often DNA fragments from species we don't care about and that can complicate analysis. For example, if you collect DNA from surfaces that humans touch like is the case in transit systems, you tend to get a lot of contaminating human reads in your data. To remove this, we often use an algorithm called an aligner which tries to match each read to a reference genome of the species we want to remove. If a read matches, it gets removed so it doesn't undermine downstream analysis.
There would be many more detailed analyses a professional could, and likely would, perform downstream. For example, a scientist might want to know if a bacterial species in a sample is resistant strain.

Here's a flowchart with some of the additional steps:

%% Sequence diagram code flowchart LR id1(Look: assigned dataset)-->id2(Quality Control: Aapter Removal, Filtering); id2-->id3(Not shown: Human read removal); id3-->id4(Metagenomics Analysis); id4-->id5(Identify: Abundant Species); id5-->id6(Barplots); id6-->id7(Principal Component Analysis); id7-->id8(Abund. Spp. Resistance Loci); style id1 fill:#ddd, stroke:#524,stroke-width:4px; style id2 fill:#ddd, stroke:#f96,stroke-width:4px; style id3 fill:#ddd, stroke:#f96,stroke-width:4px; style id4 fill:#ddd, stroke:#524,stroke-width:4px; style id5 fill:#7be, stroke:#524,stroke-width:4px; style id6 fill:#7be, stroke:#524,stroke-width:4px; style id7 fill:#7be, stroke:#f96,stroke-width:4px; style id8 fill:#ddd, stroke:#524,stroke-width:4px;

Linux Step R Step Additional Step

Post-VTP Survey

Please complete the Post-VTP Survey.