This guide is intended to teach you how to teach you one component of metagenomic analysis: how to ID and quanify the microbial species in a sequenced metagenomic sample. It assumes you have a sample available and that you have a machine with 64GB of available RAM.
Overall Approach:
Part 1 of this project is completed in the Linux terminal (in RStudio), and has 5 parts:
Part 2 of this project is completed using the R programming language and in the RStudio scripting pane and console. It has 5 parts:
Login to the same RStudio instance you’ve been completing your Pre-VTP exercises in.
Once you’ve logged in, select the “Terminal” tab.
By default, you should be in your home (i.e. ~
) directory.
Now execute:
$ ls
At a minimum, you should see these assets in your home directory:
R
databases
miniconda3
pass.txt
reads
rstudio
taxa_table.csv
(Your home directory may have additional assets becuase you created them while completing the Pre-VTP exercises. This is completely fine.)
To start, you’re going to complete the analysis with a “mini” version of your assigned dataset. This will enable you to run through the analysis pipeline more quickly (without waiting a long time at each step).
We’ve taken 50K reads from the read1 and read2 files of your assigned sample. They are labeled with the prefix mini_
and located in the reads
directory.
Navigate to the reads
directory (which contains the “mini” fastq data):
$ cd ~/reads/
Note the composition of the reads
directory:
$ ls
PLEASE NOTE: make sure to complete all steps in this sectionfrom the reads
directory.
Find your “Assigned Sample ID” (Column N) in the Google Sheet: https://docs.google.com/spreadsheets/d/1irV3-fyiqvYbHnfKveNu5IinK7Mqoo8tN4jr1zyLPgM/edit?usp=sharing
Take a look at the first 10 lines of your assigned fastq files:
$ zcat mini_my_reads.R1.fastq.gz | head
$ zcat mini_my_reads.R2.fastq.gz | head
(Please note: mini_my_reads
is a placeholder for your assigned miniature reads dataset.)
You can read a little about FASTQ file format here: https://en.wikipedia.org/wiki/FASTQ_format
Checkpoint (CP) 1.
In a single message in your group Slack channel, Please post screenshots of your zcat
results and answer these questions: 1. How are fastq files are structured? 2. Why are there two reads files for each sample?
Check AdapterRemoval
is working by opening your terminal and typing:
$ AdapterRemoval -h
You should see a long list of all the arguments that can be given to this tool.
Run AdapterRemoval on your sample (explanatory version):
$ AdapterRemoval
--file1 mini_my_reads.R1.fastq.gz --file2 mini_my_reads.R2.fastq.gz # this tells the program where to find our input data.
--trimns # this tells the program to remove reads with ambiguous bases or 'N's
--trimqualities --minquality 20 # this tells the program to remove reads with low-quality bases
--threads 2 # this tells the computer to multithread the program with 2 cores
--gzip # this tells the program that both the input and output files are compressed with a program call gzip
--output1 clean.mini_my_reads.R1.fastq.gz --output2 clean.mini_my_reads.R2.fastq.gz # this tells the program where to put its output
Run AdapterRemoval on your mini sample (executable version):
$ AdapterRemoval --file1 mini_my_reads.R1.fastq.gz --file2 mini_my_reads.R2.fastq.gz --trimns --trimqualities --minquality 20 --threads 2 --gzip --output1 clean.mini_my_reads.R1.fastq.gz --output2 clean.mini_my_reads.R2.fastq.gz
Please note: my_reads
is a placeholder for your assigned sample ID.
Confirm the the files were generated by running ls
and locating the files:
$ ls
Now we have read files that are clean. Look at one of the output files by running:
$ zcat clean.mini_my_reads.R1.fastq.gz | head
$ zcat clean.mini_my_reads.R2.fastq.gz | head
Run this command to use bowtie2 to align and remove human reads from your assigned sample (explanatory version):
$ bowtie2
-p 2 # this tells the computer to multithread the program with 2 cores
--sensitive # this tells bowtie2 how hard to look for a match
--un-conc-gz nonhuman.mini_my_reads.R%.fastq.gz # this tells the program where to put reads that do not map to the human genome
-x ~/databases/hg38.bt2 # this is our copy of the human reference genome, specially formatted for this tool
-1 clean.mini_my_reads.R1.fastq.gz -2 clean.mini_my_reads.R2.fastq.gz # our reads that we cleaned with AdapterRemoval
> /dev/null # throw away additional output from this program
Run this command to use bowtie2 to align and remove human reads from your assigned sample (executable version):
$ bowtie2 -p 2 --sensitive --un-conc-gz nonhuman.mini_my_reads.R%.fastq.gz -x ~/databases/hg38.bt2 -1 clean.mini_my_reads.R1.fastq.gz -2 clean.mini_my_reads.R2.fastq.gz > /dev/null
We can count the number of reads in a file by running:
$ zcat mini_my_reads.R1.fastq.gz | wc -l
and dividing by four.
$ zcat mini_my_reads.R2.fastq.gz | wc -l
and dividing by four.
$ zcat clean.mini_my_reads.R1.fastq.gz | wc -l
and dividing by four.
$ zcat clean.mini_my_reads.R2.fastq.gz | wc -l
and dividing by four.
$ zcat nonhuman.mini_my_reads.R1.fastq.gz | wc -l
and dividing by four.
$ zcat nonhuman.mini_my_reads.R2.fastq.gz | wc -l
and dividing by four.
CP 2.
Please post screenshots of your results in your Group Slack Channel and answer these questions: 1. Why do you think bowtie2 uses an indexed reference genome? 2. Report the number of reads in your raw reads fastqs, your adapter removed/filtered fatsqs, and your human removed fastqs. 3. Explain why it’s necessary to divide by four? If you’re not sure, look at the first few reads of one of your reads files to try and figure this out.
Launch Kraken (explanatory version):
$ conda # the name of the package manager
install # this is a subcommand, miniconda has several other functions
--channel bioconda
--yes # give conda permission to install software
kraken # the name of the program we are installing
Launch Kraken (executable version):
$ conda install --channel bioconda --yes kraken
You should now be able to test Kraken by running:
$ kraken --help
If everything is working, we can now perform taxonomic classification on our data using this command (explanatory version):
$ kraken
--gzip-compressed # tells kraken that our reads are compressed
--fastq-input # tells kraken that our reads are formatted as fastq files
--threads 2 # uses 2 threads
--paired # tells kraken that we have paired end data
--preload # loads the database into RAM before running
--db ~/databases/minikraken_20171101_8GB_dustmasked # tells kraken where to find the database
nonhuman.mini_my_reads.R1.fastq.gz # Filepath to Read1 file
nonhuman.mini_my_reads.R2.fastq.gz # Filepath to Read2 file
> mysampleID_mini_taxonomic_report.csv
If everything is working, we can now perform taxonomic classification on our data using this command (executable version):
$ kraken --gzip-compressed --fastq-input --threads 2 --paired --preload --db ~/databases/minikraken_20171101_8GB_dustmasked nonhuman.mini_reads.R1.fastq.gz nonhuman.mini_reads.R2.fastq.gz > mysampleID_mini_taxonomic_report.csv
BE SURE to replace mysampleID
with your assigned sample ID.
Once this command runs, take a look at the output by running:
$ head mysampleID_mini_taxonomic_report.csv
(the head
command prints the first 10 lines of any file to the terminal)
You should see some names of microbes along with some summary statistics about their presence in our samples. As is, this format is not very useful so we will convert it into a better format using a tool provided by Kraken.
Run:
$ kraken-mpa-report --db ~/databases/minikraken_20171101_8GB_dustmasked mysampleID_mini_taxonomic_report.csv > mysampleID_mini_final_taxonomic_results.mpa
$ head -20 mysampleID_mini_final_taxonomic_results.mpa
Look at the last 20 lines by executing:
$ tail -20 mysampleID_mini_final_taxonomic_results.mpa
grep 's__' mysampleID_mini_final_taxonomic_results.mpa > mysample_mini_species_only.txt
See how many species were identified:
wc -l mysample_mini_species_only.txt
Look at the first 10 lines:
head mysample_mini_species_only.txt
CP 3.
Please post screenshots of your results and answer these questions about Kraken: 1. What do d, p, and other abbreviations stand for? 2. What do you think the numbers at the end of each line signify? 3. Why do you think Kraken can’t characterize some reads at all? Why do you think Kraken can’t characterize some reads down to the species level? 4. When identifying only the species-level taxonomic assignments, what “search term” is best to use? 5. How many species did you ID in your sample?
Login to your assigned RSstudio instance.
Up until this point, you have been executing tasks in the Linux Terminal feature of RStudio, but this is actually not the primary feature of RStudio. Rather, RStudio is typically used to execute R Programming language code. R is typically used to for downstream statistical analysis and data visualization, which is what we’ll be using it for here.
RStudio is an integrated development environment (IDE) for the R programming language, which basically means it brings the core components of R (Scripting Pane, Console, Environment Pane, File Manager/Plots/Help Pane) into a quadrant-based user interface for efficient use.
Notice we’ve changed the look of your dashboard to have additional quadrant in the upper left corner of the screen. The quadrant structure of the RStudio user interface is now: Console Pane (Lower Left Quadrant), Scripting Pane (Upper Left Quadrant), File Manager/Plots Display/Help-Tab Pane (Lower Right Quadrant), Environment/History-Tab Pane (Upper Right Quadrant).
In R, code can always executed via the Console, but you also have the choice whether to execute that code in a Script (opened in the Scripting Pane). Unless you need to quickly execute a one-liner (e.g. setting a working directory using the setwd
command), you’ll want to be writing your code in a script, highlighting it, and clicking Run
to execute it in the console. This is because you can easily edit code in a script and re-run it. Once code is run in the Console it is not editable.
In Part 1, you learned how to convert a set of raw DNA sequences into high quality taxonomic profiles of your metagenomic samples on the command line. While taxonomic profiles are an important step, they can’t be easily interpreted by human beings.
In this section, you will learn how to use R to generate two plots that help visualize the similarities and differences between a subset of the Metasub samples and compare it to metagenomics samples from the Human Microbiome project.
We have provided two kraken files, one with a subset of kraken data (taxa_table.csv
) and one with all of the data (kraken_all_samples.csv
) , for your convenience.
First we will plot our samples as a stacked-barplot. A stacked-barplot shows a set of numbers as a series of columns one on top of each other colored by a label. In our case, a taxonomic profile is a set of numerical abundances labeled by the microbial species it belongs to.
Here’s how we make this plot in R:
library(ggplot2) # These lines load additional libraries into our environment
taxa = read.csv('taxa_table.csv', header=TRUE, sep=',') # Read our taxonomic table into a computational object from a file
taxa = taxa[taxa$rank == 'phylum',] # This filters our taxonomic table to a specific taxonomic rank. One of Kingdom, Phylum, Class, Order, Family, Genus, Species. Play around with a few other ranks.
ggplot(taxa, aes(x=sample, y=percent_abundance, fill=taxon)) + # this creates a ggplot object. Can you figure out what the aes(...) section is doing?
geom_bar(stat="identity") + # this tells ggplot how we want our data to be displayed
xlab('Sample') + # These lines tell ggplot what our axis labels should be
ylab('Abundance') +
labs(fill='Phylum') +
theme(axis.text.x = element_text(angle = 90)) # this rotates the x-axis text 90 degrees
Here is a video of this step being performed: https://youtu.be/WABocs_Gu4Y.
Take a look at this plot. In particular, notice how it is difficult to see small low abundance groups. The height of each group is proportional to its percent abundance in the sample. However, we often want to make small groups more visible than they are by default. Can you figure out how to modify the y axis scaling of this plot to show small groups more clearly?
As humans, it’s easiest for us to see plots in two dimensions, but microbiome data is not inherently two dimensional. In a formal mathematical sense, microbiome data has a dimensionality equal to the number of microbes which can be identified – often thousands. However, most of these dimensions are highly correlated meaning, roughly, that most of the information about the abundance of some particular microbe can probably be well represented by using the abundance of a different microbe and a constant scaling factor.
In fact, it is possible to find sets of scaling factors that (likely) represent as much information as possible about a set of samples using a technique called Principal Component Analysis (or PCA for short). PCA is mathematically complex and we won’t explain it here but we will do an exercise to show how it can be used in R to plot your data in 2D.
library(ggplot2)
library(reshape2)
taxa = read.csv('taxa_table.csv', header=TRUE, sep=',')
taxa = taxa[taxa$rank == 'species',]
taxa = acast(taxa, sample~taxon) # add some print statements and see if you can understand what this line does
pca = prcomp(taxa) # run PCA on our samples. look at the pca object and get a sense of its features
principal_components = pca$x[,1:2] # grab the first two most important components from PCA. Play with some other sets.
principal_components = data.frame(principal_components)
p = ggplot(principal_components, aes(x=PC1, y=PC2)) +
geom_point(size=3)
print(p)
Note how many clusters of points you see. Add color to this plot to indicate which samples are from the environment (i.e. Metasub’s subway samples), the human gut, and the “mystery” samples.
Execute the PCA Plot with color:
library(ggplot2)
library(reshape2)
taxa = read.csv('taxa_table.csv', header=TRUE, sep=',')
taxa = taxa[taxa$rank == 'species',]
taxa = acast(taxa, sample~taxon) # add some print statements and see if you can understand what this line does
pca = prcomp(taxa) # run PCA on our samples. look at the pca object and get a sense of its features
principal_components = pca$x[,1:2] # grab the first two most important components from PCA. Play with some other sets.
principal_components = data.frame(principal_components)
print(principal_components)
principal_components$sample_type = unlist(
lapply(rownames(principal_components), FUN=function(x){strsplit(x, '\\.')[[1]][1]})
)
principal_components$sample_type = unlist(lapply(principal_components$sample_type, FUN=function(x){
if(x == 'gastrointestinal' || x == 'mystery'){
return(x)
}
return('environmental')
}))
p = ggplot(principal_components, aes(x=PC1, y=PC2, color=sample_type)) +
geom_point(size=3) +
labs(fill='Samples')
print(p)
Here is a video of this step being performed: https://youtu.be/3H2wGMWdflY
Now create a barplot with all of the metasub samples:
library(ggplot2)
taxa_all_barplot = read.csv('kraken_all_metasub.csv', header=TRUE, sep=',')
taxa_all_barplot = taxa_all_barplot[taxa_all_barplot$rank == 'phylum',]
ggplot(taxa_all_barplot, aes(x=sample, y=percent_abundance, fill=taxon)) +
geom_bar(stat="identity") +
xlab('Sample') +
ylab('Abundance') +
labs(fill='Phylum') +
theme(axis.text.x = element_text(angle = 90))
CP 4:
Please post your results through this step. Briefly compare the metasub, HMP Gastrointestional, and mystery sample composition to each other. Submit your barplots and PCA and tell us your best hypothesis as to the source of the mystery samples.
Read-in all of the Kraken data that you and your classmates generated:
taxa_all = read.csv('kraken_all_metasub.csv', header=TRUE, sep=',')
Click on the object and sort by sample name.
Locate 3-5 high abundance species in your sample and use the Metasub dashboard to detect if if there are strains of these spp that have evolved resistance to antibiotics:
Navigate to https://www.patricbrc.org/. Use the tools in Patric to ID at least one AMR locus/gene for each of the spp you search.
Here’s a screencast showing one example of how to do this with staphylococcus aureus: https://youtu.be/dMK2abwzoZE.
CP 5:
Post your results from the Patric database to Slack.