This guide is intended to teach you how to teach you one component of metagenomic analysis: how to ID and quanify the microbial species in a sequenced metagenomic sample. It assumes you have a sample available and that you have a machine with 64GB of available RAM.

Overall Approach:

Part 1 of this project is completed in the Linux terminal (in RStudio), and has 5 parts:

  1. Login into your RStudio instance and access the “terminal” tab
  2. Look at your assigned data (mini dataset)
  3. Remove adapters and filter-out low quality reads (mini dataset)
  4. Remove contaminating human DNA from your sample (mini dataset)
  5. Execute Metagenomic Analysis (mini dataset)

Part 2 of this project is completed using the R programming language and in the RStudio scripting pane and console. It has 5 parts:

  1. Login to RStudio Instance (URL/username in the Google Sheet)
  2. Generate barplot using subset of kraken samples
  3. Generate pca plot using subset of kraken samples
  4. Generate barplot using all kraken samples that you and your classmates analyzed
  5. Sort species abundance from highest to lowest, ID AMR Loci

Part 1

(1) Login into your RStudio instance and access the “Terminal” tab. (time estimate: 2 min)

Login to the same RStudio instance you’ve been completing your Pre-VTP exercises in.

Once you’ve logged in, select the “Terminal” tab.

By default, you should be in your home (i.e. ~) directory.

Now execute:

$ ls

At a minimum, you should see these assets in your home directory:

R databases miniconda3 pass.txt reads rstudio taxa_table.csv

(Your home directory may have additional assets becuase you created them while completing the Pre-VTP exercises. This is completely fine.)

(2) Look at your assigned dataset (time estimate: 5-10 min)

To start, you’re going to complete the analysis with a “mini” version of your assigned dataset. This will enable you to run through the analysis pipeline more quickly (without waiting a long time at each step).

We’ve taken 50K reads from the read1 and read2 files of your assigned sample. They are labeled with the prefix mini_ and located in the reads directory.

Navigate to the reads directory (which contains the “mini” fastq data):

$ cd ~/reads/

Note the composition of the reads directory:

$ ls

PLEASE NOTE: make sure to complete all steps in this sectionfrom the reads directory.

Find your “Assigned Sample ID” (Column N) in the Google Sheet: https://docs.google.com/spreadsheets/d/1irV3-fyiqvYbHnfKveNu5IinK7Mqoo8tN4jr1zyLPgM/edit?usp=sharing

Take a look at the first 10 lines of your assigned fastq files:

$ zcat mini_my_reads.R1.fastq.gz | head

$ zcat mini_my_reads.R2.fastq.gz | head

(Please note: mini_my_reads is a placeholder for your assigned miniature reads dataset.)

You can read a little about FASTQ file format here: https://en.wikipedia.org/wiki/FASTQ_format

Checkpoint (CP) 1. In a single message in your group Slack channel, Please post screenshots of your zcat results and answer these questions: 1. How are fastq files are structured? 2. Why are there two reads files for each sample?

(3) Remove adapters and filter-out low quality reads (time estimate: 5-10 min)

Check AdapterRemoval is working by opening your terminal and typing:

$ AdapterRemoval -h

You should see a long list of all the arguments that can be given to this tool.

Run AdapterRemoval on your sample (explanatory version):

$ AdapterRemoval
    --file1 mini_my_reads.R1.fastq.gz --file2 mini_my_reads.R2.fastq.gz  # this tells the program where to find our input data.
    --trimns  # this tells the program to remove reads with ambiguous bases or 'N's
    --trimqualities --minquality 20  # this tells the program to remove reads with low-quality bases
    --threads 2  # this tells the computer to multithread the program with 2 cores
    --gzip  # this tells the program that both the input and output files are compressed with a program call gzip
    --output1 clean.mini_my_reads.R1.fastq.gz  --output2 clean.mini_my_reads.R2.fastq.gz  # this tells the program where to put its output

Run AdapterRemoval on your mini sample (executable version):

$ AdapterRemoval --file1 mini_my_reads.R1.fastq.gz --file2 mini_my_reads.R2.fastq.gz --trimns --trimqualities --minquality 20 --threads 2 --gzip --output1 clean.mini_my_reads.R1.fastq.gz  --output2 clean.mini_my_reads.R2.fastq.gz

Please note: my_reads is a placeholder for your assigned sample ID.

Confirm the the files were generated by running ls and locating the files:

$ ls

Now we have read files that are clean. Look at one of the output files by running:

$ zcat clean.mini_my_reads.R1.fastq.gz | head

$ zcat clean.mini_my_reads.R2.fastq.gz | head

(4) Remove contaminating human DNA from your sample (time estimate: 5-10 min)

Run this command to use bowtie2 to align and remove human reads from your assigned sample (explanatory version):

$ bowtie2
    -p 2  # this tells the computer to multithread the program with 2 cores
    --sensitive  # this tells bowtie2 how hard to look for a match
    --un-conc-gz nonhuman.mini_my_reads.R%.fastq.gz  # this tells the program where to put reads that do not map to the human genome
    -x ~/databases/hg38.bt2  # this is our copy of the human reference genome, specially formatted for this tool
    -1 clean.mini_my_reads.R1.fastq.gz -2 clean.mini_my_reads.R2.fastq.gz  # our reads that we cleaned with AdapterRemoval
    > /dev/null  # throw away additional output from this program

Run this command to use bowtie2 to align and remove human reads from your assigned sample (executable version):

$ bowtie2 -p 2 --sensitive --un-conc-gz nonhuman.mini_my_reads.R%.fastq.gz -x ~/databases/hg38.bt2 -1 clean.mini_my_reads.R1.fastq.gz -2 clean.mini_my_reads.R2.fastq.gz > /dev/null  

We can count the number of reads in a file by running:

$ zcat mini_my_reads.R1.fastq.gz | wc -l and dividing by four.

$ zcat mini_my_reads.R2.fastq.gz | wc -l and dividing by four.

$ zcat clean.mini_my_reads.R1.fastq.gz | wc -l and dividing by four.

$ zcat clean.mini_my_reads.R2.fastq.gz | wc -l and dividing by four.

$ zcat nonhuman.mini_my_reads.R1.fastq.gz | wc -l and dividing by four.

$ zcat nonhuman.mini_my_reads.R2.fastq.gz | wc -l and dividing by four.

CP 2. Please post screenshots of your results in your Group Slack Channel and answer these questions: 1. Why do you think bowtie2 uses an indexed reference genome? 2. Report the number of reads in your raw reads fastqs, your adapter removed/filtered fatsqs, and your human removed fastqs. 3. Explain why it’s necessary to divide by four? If you’re not sure, look at the first few reads of one of your reads files to try and figure this out.

(5) Execute Metagenomic Analysis (time estimate: 5-10 min)

Launch Kraken (explanatory version):

$ conda  # the name of the package manager
install  # this is a subcommand, miniconda has several other functions
--channel bioconda
--yes  # give conda permission to install software
kraken  # the name of the program we are installing

Launch Kraken (executable version):

$ conda install --channel bioconda --yes kraken 

You should now be able to test Kraken by running:

$ kraken --help

If everything is working, we can now perform taxonomic classification on our data using this command (explanatory version):

$ kraken
--gzip-compressed  # tells kraken that our reads are compressed
--fastq-input  # tells kraken that our reads are formatted as fastq files
--threads 2  # uses 2 threads
--paired  # tells kraken that we have paired end data
--preload  # loads the database into RAM before running
--db ~/databases/minikraken_20171101_8GB_dustmasked  # tells kraken where to find the database
nonhuman.mini_my_reads.R1.fastq.gz # Filepath to Read1 file
nonhuman.mini_my_reads.R2.fastq.gz # Filepath to Read2 file
> mysampleID_mini_taxonomic_report.csv

If everything is working, we can now perform taxonomic classification on our data using this command (executable version):

$ kraken --gzip-compressed --fastq-input --threads 2 --paired --preload --db ~/databases/minikraken_20171101_8GB_dustmasked nonhuman.mini_reads.R1.fastq.gz  nonhuman.mini_reads.R2.fastq.gz > mysampleID_mini_taxonomic_report.csv

BE SURE to replace mysampleID with your assigned sample ID.

Once this command runs, take a look at the output by running:

$ head mysampleID_mini_taxonomic_report.csv

(the head command prints the first 10 lines of any file to the terminal)

You should see some names of microbes along with some summary statistics about their presence in our samples. As is, this format is not very useful so we will convert it into a better format using a tool provided by Kraken.

Run:

$ kraken-mpa-report --db ~/databases/minikraken_20171101_8GB_dustmasked mysampleID_mini_taxonomic_report.csv > mysampleID_mini_final_taxonomic_results.mpa
$ head -20 mysampleID_mini_final_taxonomic_results.mpa

Look at the last 20 lines by executing:

$ tail -20 mysampleID_mini_final_taxonomic_results.mpa
grep 's__' mysampleID_mini_final_taxonomic_results.mpa > mysample_mini_species_only.txt

See how many species were identified:

wc -l mysample_mini_species_only.txt

Look at the first 10 lines:

head mysample_mini_species_only.txt

CP 3. Please post screenshots of your results and answer these questions about Kraken: 1. What do d, p, and other abbreviations stand for? 2. What do you think the numbers at the end of each line signify? 3. Why do you think Kraken can’t characterize some reads at all? Why do you think Kraken can’t characterize some reads down to the species level? 4. When identifying only the species-level taxonomic assignments, what “search term” is best to use? 5. How many species did you ID in your sample?

Part 2

(6) Login to RStudio Instance (URL/username in the Google Sheet)

Login to your assigned RSstudio instance.

Up until this point, you have been executing tasks in the Linux Terminal feature of RStudio, but this is actually not the primary feature of RStudio. Rather, RStudio is typically used to execute R Programming language code. R is typically used to for downstream statistical analysis and data visualization, which is what we’ll be using it for here.

RStudio is an integrated development environment (IDE) for the R programming language, which basically means it brings the core components of R (Scripting Pane, Console, Environment Pane, File Manager/Plots/Help Pane) into a quadrant-based user interface for efficient use.

Notice we’ve changed the look of your dashboard to have additional quadrant in the upper left corner of the screen. The quadrant structure of the RStudio user interface is now: Console Pane (Lower Left Quadrant), Scripting Pane (Upper Left Quadrant), File Manager/Plots Display/Help-Tab Pane (Lower Right Quadrant), Environment/History-Tab Pane (Upper Right Quadrant).

In R, code can always executed via the Console, but you also have the choice whether to execute that code in a Script (opened in the Scripting Pane). Unless you need to quickly execute a one-liner (e.g. setting a working directory using the setwd command), you’ll want to be writing your code in a script, highlighting it, and clicking Run to execute it in the console. This is because you can easily edit code in a script and re-run it. Once code is run in the Console it is not editable.

(7) Generate barplot using subset of kraken samples

In Part 1, you learned how to convert a set of raw DNA sequences into high quality taxonomic profiles of your metagenomic samples on the command line. While taxonomic profiles are an important step, they can’t be easily interpreted by human beings.

In this section, you will learn how to use R to generate two plots that help visualize the similarities and differences between a subset of the Metasub samples and compare it to metagenomics samples from the Human Microbiome project.

We have provided two kraken files, one with a subset of kraken data (taxa_table.csv) and one with all of the data (kraken_all_samples.csv) , for your convenience.

First we will plot our samples as a stacked-barplot. A stacked-barplot shows a set of numbers as a series of columns one on top of each other colored by a label. In our case, a taxonomic profile is a set of numerical abundances labeled by the microbial species it belongs to.

Here’s how we make this plot in R:

library(ggplot2)   # These lines load additional libraries into our environment

taxa = read.csv('taxa_table.csv', header=TRUE, sep=',')  # Read our taxonomic table into a computational object from a file

taxa = taxa[taxa$rank == 'phylum',]  # This filters our taxonomic table to a specific taxonomic rank. One of Kingdom, Phylum, Class, Order, Family, Genus, Species. Play around with a few other ranks.

ggplot(taxa, aes(x=sample, y=percent_abundance, fill=taxon)) + # this creates a ggplot object. Can you figure out what the aes(...) section is doing?
  geom_bar(stat="identity") +  # this tells ggplot how we want our data to be displayed
  xlab('Sample') +  # These lines tell ggplot what our axis labels should be
  ylab('Abundance') +
  labs(fill='Phylum') +
  theme(axis.text.x = element_text(angle = 90)) # this rotates the x-axis text 90 degrees

Here is a video of this step being performed: https://youtu.be/WABocs_Gu4Y.

Take a look at this plot. In particular, notice how it is difficult to see small low abundance groups. The height of each group is proportional to its percent abundance in the sample. However, we often want to make small groups more visible than they are by default. Can you figure out how to modify the y axis scaling of this plot to show small groups more clearly?

(8) Generate pca plot using subset of kraken samples

As humans, it’s easiest for us to see plots in two dimensions, but microbiome data is not inherently two dimensional. In a formal mathematical sense, microbiome data has a dimensionality equal to the number of microbes which can be identified – often thousands. However, most of these dimensions are highly correlated meaning, roughly, that most of the information about the abundance of some particular microbe can probably be well represented by using the abundance of a different microbe and a constant scaling factor.

In fact, it is possible to find sets of scaling factors that (likely) represent as much information as possible about a set of samples using a technique called Principal Component Analysis (or PCA for short). PCA is mathematically complex and we won’t explain it here but we will do an exercise to show how it can be used in R to plot your data in 2D.

library(ggplot2)
library(reshape2)

taxa = read.csv('taxa_table.csv', header=TRUE, sep=',')
taxa = taxa[taxa$rank == 'species',]
taxa = acast(taxa, sample~taxon)  # add some print statements and see if you can understand what this line does

pca = prcomp(taxa)  # run PCA on our samples. look at the pca object and get a sense of its features

principal_components = pca$x[,1:2]  # grab the first two most important components from PCA. Play with some other sets.

principal_components = data.frame(principal_components)

p = ggplot(principal_components, aes(x=PC1, y=PC2)) +
    geom_point(size=3)

print(p)

Note how many clusters of points you see. Add color to this plot to indicate which samples are from the environment (i.e. Metasub’s subway samples), the human gut, and the “mystery” samples.

Execute the PCA Plot with color:

library(ggplot2)
library(reshape2)

taxa = read.csv('taxa_table.csv', header=TRUE, sep=',')
taxa = taxa[taxa$rank == 'species',]
taxa = acast(taxa, sample~taxon)  # add some print statements and see if you can understand what this line does

pca = prcomp(taxa)  # run PCA on our samples. look at the pca object and get a sense of its features

principal_components = pca$x[,1:2]  # grab the first two most important components from PCA. Play with some other sets.

principal_components = data.frame(principal_components)

print(principal_components)

principal_components$sample_type = unlist(
  lapply(rownames(principal_components), FUN=function(x){strsplit(x, '\\.')[[1]][1]})
)

principal_components$sample_type = unlist(lapply(principal_components$sample_type, FUN=function(x){
  if(x == 'gastrointestinal' || x == 'mystery'){
    return(x)
  }
  return('environmental')
}))

p = ggplot(principal_components, aes(x=PC1, y=PC2, color=sample_type)) +
  geom_point(size=3) +
  labs(fill='Samples')

print(p)

Here is a video of this step being performed: https://youtu.be/3H2wGMWdflY

(9) Generate barplot using all kraken samples that you and your classmates analyzed

Now create a barplot with all of the metasub samples:

library(ggplot2)  

taxa_all_barplot = read.csv('kraken_all_metasub.csv', header=TRUE, sep=',') 

taxa_all_barplot = taxa_all_barplot[taxa_all_barplot$rank == 'phylum',] 

ggplot(taxa_all_barplot, aes(x=sample, y=percent_abundance, fill=taxon)) +
  
  geom_bar(stat="identity") +  
  xlab('Sample') + 
  ylab('Abundance') +
  labs(fill='Phylum') +
  theme(axis.text.x = element_text(angle = 90)) 

CP 4: Please post your results through this step. Briefly compare the metasub, HMP Gastrointestional, and mystery sample composition to each other. Submit your barplots and PCA and tell us your best hypothesis as to the source of the mystery samples.

(10) Sort species abundance from highest to lowest, ID AMR Loci

Read-in all of the Kraken data that you and your classmates generated:

taxa_all = read.csv('kraken_all_metasub.csv', header=TRUE, sep=',')

Click on the object and sort by sample name.

Locate 3-5 high abundance species in your sample and use the Metasub dashboard to detect if if there are strains of these spp that have evolved resistance to antibiotics:

http://metasub.org/map/#

Navigate to https://www.patricbrc.org/. Use the tools in Patric to ID at least one AMR locus/gene for each of the spp you search.

Here’s a screencast showing one example of how to do this with staphylococcus aureus: https://youtu.be/dMK2abwzoZE.

CP 5: Post your results from the Patric database to Slack.