⚠️ Homework due before this week’s lab: compose two questions for the Illumina Bioinformatician who is giving a Career Jounrey + Q&A Talk to your lab section. Post your questions in the Checkpoint (CP) Results Tab of the Google Sheet
⚠️ If haven’t completed the Pre-VTP Survey please do so before you start the project
Here is a Google Sheet that contains each participant:
General Info
tab)General Info
tab)Checkpoint (CP) Results
tab)Please join the Slack channel for this course You should have received an email to do this. If you didn’t, please send your email address to [email protected].
The primary goal of this project is to use the Linux and R programming languages to bioinformatically characterize, quantify and visualize the species composition of urban microbiome samples (i.e. subway swabs) from raw data to completed analysis.
What is bioinformatics?
Think, Room (4-5 ppl), Share
:
1-minute: think about one lab, activity, or lesson you prevously taught that had an element that could have been explored using bioinformatics analysis? (Explain)
3-minutes: divide into breakout rooms and discuss with your group
2-minutes: re-join the main meeting room and present 1-2 examples to the other participants.
Why is the metagenomics application of bioinformatics useful?
In case you (or more likely, your students) are thinking, “How would someone be able to use this knowledge in real life?”, here are three real-world applications of the metagenomics analysis technique you will learn:
Research Applications
: you can ask questions about
the similarity of microbiome samples, the prevalence/emergence of
antimicrobial resistant strains, etc. The Metasub Project, which is
where the data you’ll analyze in this VTP comes from, would be an
example of this application.
Clinical Applications
: This technique is already
being employed by some biotech startups, like Karius (https://kariusdx.com/), to
rapidly diagnose blood-borne infections.
Biotech/Industry Applications
: Some companies offer
microbial surveillance services to identify and monitor the presence of
resistant pathogens (e.g. on hospital surfaces). One such startup called
Biotia (https://www.biotia.io/) was spun out of the Mason lab
(the same Cornell Med lab that created the Metasub project) and utilizes
many of the same techniques that you will employ in this VTP.”
In a broader sense, bioinformatics is a subdivision of data science; principles learned in the former are highly relevant to the latter. We’ll point these out as we go along.
History of metagenomics:
Example: Clinical Application of Metagenomics:
Think, Room (4-5 ppl), Share
:
1-minute: Come up with one potential application of metagenomics not discussed here.
3-minutes: divide into breakout rooms and discuss with your group
2-minutes: re-join the main meeting room and present 1-2 examples from your breakout room to the other participants.
To accomplish this goals in this project, we will use data from the Metasub project, an effort to characterize the “built environment” microbiomes of mass transit systems around the world, headed by Dr. Chris Mason’s lab at Weill Cornell Medical Center (http://www.masonlab.net/).
Here’s the recent Metasub paper in case you haven’t seen it and would like to review it later: https://milrd.org/wp-content/uploads/2022/09/Cell_A-global-metagenomic-map-of-urban-microbiomesand-antimicrobial-resistance.pdf
(To be frank, this article is a bit challenging to read, so we suggest you review it later on if you’re inclined, after you’ve done a bit of the project.)
Some additional information in case you’re interested:
New York Times article about the Metasub Project: https://www.nytimes.com/2021/05/26/science/microbes-subway-metasub-mason.html?smid=url-share
Metasub Project website: http://metasub.org
Metasub was borne out of a project called PathoMap (also from the Mason lab), which began in summer 2013 to profile the New York City metagenome in, around, and below NYC on mass-transit areas of the built environment, focusing on the subway.
Here’s the Pathomap paper in case you would like to review it later: https://milrd.org/wp-content/uploads/2022/09/Geospatial-Resolution-of-Human-and-Bacterial-Diversity-with-City-Scale-Metagenomics.pdf
Pathomap sought to establish baseline profiles across the subway system, identify potential bio-threats, and provide an additional level of data that can be used by the city to create a “smart city;” i.e., one that uses high- dimensional data to improve city planning, management, and human health.”
Metasub extended the Pathomap project based on the recognition that NYC is not the only city in the world that could benefit from a systematic, longitudinal metagenomic profile of its subway system.
Although NYC subway has the most stations, it ranks 7th in the world in term of the number of riders per year. A wide variety of population density, length, and climate types define the busiest subways of the world, ranging from cold (Moscow) to temperate (New York City, Paris), to subtropical (Mexico City) and tropical (São Paulo).
To address this gap in our knowledge of the built environment, the Mason lab created Metasub: an international consortium of laboratories to establish a world-wide “DNA map” of microbiomes in mass transit systems.
Take a look at Figure 1, which provides an overview of the Pathomap project’s design and execution:
As you can see, the researchers collected samples from
New York City’s five boroughs
Collected samples from the 466 subway stations of NYC across the 24 subway lines
extracted, sequenced, QC’d and analyzed DNA
Mapped the distribution of taxa identified from the entire pooled dataset, and
presented geospatial analysis of a highly prominent genus, Pseudomonas
Notably, as seen in (D), nearly half of the DNA sequenced (48%) did not match any known organism, underscoring the vast wealth of likely unknown organisms that surround passengers every day.
Think, Room (4-5 ppl), Share
:
1-minute: Why do you think so much of the sequenced DNA did not match any known organism?
3-minutes: divide into breakout rooms and discuss with your group
2-minutes: re-join the main meeting room and present 1-2 examples from your breakout room to the other participants.
Now, let’s focus on Fig 1C, as samples from the Metasub project were collected, sequenced, and analyzed in a similar manner to the Pathomap project. Here’s a simplified version of what the Mason lab did in Fig 1C:
This guide is intended to teach you how to teach you one component of metagenomic analysis: how to plot abundances at the “phylum” level for each metagenomics sample.
Throughout the VTP, each participant characterizes, quantifies and visualize microbial metagenomics data from sequenced swabs of public urban environments on their own AWS High Performance Compute instance. In the Linux terminal, they perform genomic data quality control, genome alignment, taxonomic characterization & abundance quantification, and in R, they viaualize results, conduct a principal component analysis. To conclude, they investigate their most abundant species and use the Patric database to consider how they would determine the strains of these species.
Linux Steps R Steps
Think, Room (4-5 ppl), Share
:
1-minute: Note that we’re using two programming languages (Linux/Bash and R). Why do you think it’s necessary to use two programming languages, and not just one?
3-minutes: divide into breakout rooms and discuss with your group
2-minutes: re-join the main meeting room and present 1-2 examples from your breakout room to the other participants.
All bioinformatics tasks will be performed in “the cloud” on an Amazon Web Services (AWS) hosted high performance compute instance.
What is cloud computing:
Think, Room (4-5 ppl), Share
:
1-minute: Why do we (as well as professional researchers, data scientists, bioinformaticians) often need to perform analyses on the cloud?
3-minutes: divide into breakout rooms and discuss with your group
2-minutes: re-join the main meeting room and present 1-2 examples from your breakout room to the other participants.
What is R and RStudio
RStudio is an integrated development environment (IDE) for the open-source R programming language, which basically means it brings the core components of R (Scripting Pane, Console, Environment Pane, File Manager/Plots/Help Pane) into a quadrant-based user interface for efficient use.
Here is what the R dashboard looks like:
In R, code is always executed via the Console, but you have the
choice whether to execute that code in a Script (opened in the Scripting
Pane) or directly in the Console. Unless you need to quicly execute a
one-liner (e.g. setting a working directory using the setwd
command), you’ll want to be writing your code in a script, highlighting
it, and clicking Run
to execute it in the console. This is
because you can easily edit code in a script and re-run it. Once code is
run in the Console it is not editable.
You are provided an instance of the R editor RStudio on AWS already waiting for you to use. Login to your AWS-hosted RStudio instance using the URL, username, and password assigned to you in the Google Sheet.
Please make sure the RStudio user interface dashboard looks as follows: Console Pane (Lower Left Quadrant), Scripting Pane (Upper Left Quadrant), File Manager/Plots Display/Help-Tab Pane (Lower Right Quadrant), Environment/History-Tab Pane (Upper Right Quadrant).
Current step Linux Steps R Steps
In this section, you will learn how to use R to generate barplots of metagenomic data..
We have provided processed metagenomic data broken down by sample
(e.g. mini_SL342389_taxa.csv
and
yourSample.csv
), for your convenience. Each file contains
taxonomically characterized and quantified metagenomics data.
First we will plot our samples as a stacked-barplot at the phylum level. A stacked-barplot shows a set of numbers as a series of columns one on top of each other colored by a label. In our case, a taxonomic profile is a set of numerical abundances labeled by the microbial species it belongs to.
Next we will plot our samples as a stacked-barplot at the family level.
Here’s the template code to make this plot in R:
library(ggplot2)
taxa_yourSample <- read.csv('yourSample_taxa.csv', header=TRUE, sep=',')
family_yourSample <- taxa_yourSample[taxa_yourSample$rank == 'phylum',]
family_yourSample <- family_yourSample[family_yourSample$percent_abundance >= 0.3,]
ggplot(family_yourSample, aes(x=sample, y=percent_abundance, fill=taxon)) +
geom_bar(stat="identity") +
xlab('Sample') +
ylab('Abundance') +
labs(fill='Phylum') +
theme(axis.text.x = element_text(angle = 90))
ggplot2
library into our
environment. The library()
function is used to load
additional libraries that contain functions to be utilized by the script
that follows.read.csv()
function.ggplot()
object.ggplot()
how we want our data to be
displayedggplot()
what our axis labels should
bemini_SL342389
This code generates a phylum-level barplot by modifying
taxa_yourSample
, yourSample_taxa.csv
, and
phyla_yourSample
per Example sample.
library(ggplot2)
<- read.csv('mini_SL342389_taxa.csv', header=TRUE, sep=',')
taxa_mini_SL342389
<- taxa_mini_SL342389[taxa_mini_SL342389$rank == 'phylum',]
phyla_mini_SL342389
<- phyla_mini_SL342389[phyla_mini_SL342389$percent_abundance >= 0.3,]
phyla_mini_SL342389
ggplot(phyla_mini_SL342389 , aes(x=sample, y=percent_abundance, fill=taxon)) +
geom_bar(stat="identity") +
xlab('Sample') +
ylab('Abundance') +
labs(fill='Phylum') +
theme(axis.text.x = element_text(angle = 90))
This video shows how to perform this step with a generic dataset, but the process of code execution is ths same:
Clear your Console, Environment Tab, and Plots Tab using the 🧹 button and re-run the script chunk-by-chunk (noting what each chunk appears to be doing):
library(ggplot2)
+
<- read.csv('mini_SL342389_taxa.csv', header=TRUE, sep=',') taxa_mini_SL342389
taxa_mini_SL342389
object should be
listed in the Environment Tab. Click on the name of this object in the
environment tab and it should load as a table in a new tab in the
scripting pane (as long as it’s not too big). Alternatively, you can use
the head()
function to return the first “parts of a vector,
matrix, table, data frame or function”; to learn more about this
function, execute ?head()
in the Console. Go ahead and
execute head(taxa_mini_SL342389)
in the Console now. It
will show the first 6 lines of the taxa_mini_SL342389
by
default.)
rank
column. Even from the first few
lines of the taxa_mini_SL34238
object, it’s evident there
are taxonomic classifications made a several levels (e.g. phylum, class
order, genus, etc). Keep this in mind as you execute the next step.
+
<- taxa_mini_SL342389[taxa_mini_SL342389$rank == 'phylum',] phyla_mini_SL342389
head(taxa_mini_SL342389)
in the
Console. How did this line of code transform the data?
+
<- phyla_mini_SL342389[phyla_mini_SL342389$percent_abundance >= 0.3,] phyla_mini_SL342389
+
ggplot(phyla_mini_SL342389 , aes(x=sample, y=percent_abundance, fill=taxon)) +
geom_bar(stat="identity") +
xlab('Sample') +
ylab('Abundance') +
labs(fill='Phylum') +
theme(axis.text.x = element_text(angle = 90))
This code generates family-level barplot by modifying
taxa_yourSample
, yourSample_taxa.csv
,
phyla_yourSample
, and labs()
per Example
sample.
library(ggplot2)
<- read.csv('mini_SL342389_taxa.csv', header=TRUE, sep=',')
taxa_mini_SL342389
<- taxa_mini_SL342389[taxa_mini_SL342389$rank == 'family',]
family_mini_SL342389
<- family_mini_SL342389[family_mini_SL342389$percent_abundance >= 0.3,]
family_mini_SL342389
ggplot(family_mini_SL342389 , aes(x=sample, y=percent_abundance, fill=taxon)) +
geom_bar(stat="identity") +
xlab('Sample') +
ylab('Abundance') +
labs(fill='Family') +
theme(axis.text.x = element_text(angle = 90))
Generate a phylum-level barplot by modifying
taxa_yourSample
, yourSample_taxa.csv
, and
phyla_yourSample
in the template scriptper your assigned
sample.
library(ggplot2)
<- read.csv('yourSample_taxa.csv', header=TRUE, sep=',')
taxa_yourSample
<- taxa_yourSample[taxa_yourSample$rank == 'phylum',]
phyla_yourSample
<- phyla_yourSample[phyla_yourSample$percent_abundance >= 0.3,]
phyla_yourSample
ggplot(phyla_yourSample, aes(x=sample, y=percent_abundance, fill=taxon)) +
geom_bar(stat="identity") +
xlab('Sample') +
ylab('Abundance') +
labs(fill='Phylum') +
theme(axis.text.x = element_text(angle = 90))
Clear your Console, Environment Tab, and Plots tab using the 🧹 button and re-run the script chunk-by-chunk (noting what each chunk appears to be doing):
library(ggplot2)
+
<- read.csv('yourSample_taxa.csv', header=TRUE, sep=',') taxa_yourSample
+
<- taxa_yourSample[taxa_yourSample$rank == 'phylum',] phyla_yourSample
+
<- phyla_yourSample[phyla_yourSample$percent_abundance >= 0.3,] phyla_yourSample
+
ggplot(taxa, aes(x=sample, y=percent_abundance, fill=taxon)) +
geom_bar(stat="identity") +
xlab('Sample') +
ylab('Abundance') +
labs(fill='Phylum') +
theme(axis.text.x = element_text(angle = 90))
Modify code: edit your script to generate a barplot at the “family” level, by modifying this line of code:
<- taxa_yourSample[taxa_yourSample$rank == 'phylum',] phyla_yourSample
Please upload the following to the Google Sheet:
Barplot at the phylum level. (To download this plot as a file, click the (+) Zoom button, right click on the plot and select “Save Image As”.)
Barplot at the family level. (To download this plot as a file, click the (+) Zoom button, right click on the plot and select “Save Image As”.)
Answer to these questions:
taxa_table.csv
file into the taxa
object (copy paste the exact line of
code)?$
operator doing in this line of code:
taxa = taxa[taxa$rank == 'phylum',]
?Current step Linux Steps R Steps
Access your Linux terminal by logging into RStudio via the web browser (can be found in the the Google Sheet.
Setup to your Linux environment:
It should look like this once you’re finished setting up your Linux environment:
Please take a screenshot of the last 15 lines of your
history
output and upload it to the Google Sheet.
Current step Linux Steps R Steps
To make sure you undertstand the workflow, we’d like to first have
you complete the analysis with a small dataset. We’ve taken 50,000 reads
from the read1 and read2 files of each assigned sample. They are labeled
with the prefix mini_
and located in the reads
directory (files without the prefix mini
are the original,
full-size files, which each contain millions of reads.)
Data from each Metasub sample is contained in two files—read1 (i.e.
R1
) and read2 (i.e.R2
). This is because the
researchers paired-end sequenced each sample. When samples are
sequenced on an Illumina genomic sequencer researchers get to decide:
(1) the number of reads to generate for each sample library; (2) the
read length (within a range of ~50 - ~300 nucleotides); and (3) whether
to generate a single read or a pair of reads from each sequenced DNA
library fragment.
Here is an diagram that summarizes paired-end versus single-end sequencing:
The raw genomic data for this project is in fastqfiles. Fastq files
have the extension .fastq
.
Fastq files are collections of fastq records.
A FASTQ record has the following format:
Here’s an example of a FASTQ file with two records:
@071112_SLXA-EAS1_s_7:5:1:817:345
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
@071112_SLXA-EAS1_s_7:5:1:801:338
GTTCAGGGATACGACGTTTGTATTTTAAGAATCTGA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII6IBI
reads
directory (which
contains the “mini” fastq data):
cd ~/reads/
PLEASE NOTE: make sure to complete all Linux steps from the
reads
directory.
mini_SL342389
Take a look at the first 10 lines of the Example Sample Read1 fastq file:
zcat mini_SL342389.R1.fastq.gz | head
Take a look at the first 10 lines of the Example Sample Read2 fastq file:
zcat mini_SL342389.R2.fastq.gz | head
Take a look at the first 10 lines of Your Sample’s Read1 fastq file:
zcat mini_my_reads.R1.fastq.gz | head
Take a look at the first 10 lines of Your Sample’s Read2 fastq file:
zcat mini_my_reads.R2.fastq.gz | head
(Please note: mini_my_reads
is a placeholder for your
assigned miniature reads dataset.)
In the Google Sheet, please:
Upload a screenshot of your zcat results.
Answer these questions:
Current step Linux Steps R Steps
AdapterRemoval --file1 mini_my_reads.R1.fastq.gz --file2 mini_my_reads.R2.fastq.gz --trimns --trimqualities --minquality 40 --threads 2 --gzip --output1 clean.mini_my_reads.R1.fastq.gz --output2 clean.mini_my_reads.R2.fastq.gz
AdapterRemoval
calls the program you want to invoke.
--file1
tells the program where to find Read1 input data
--file2
tells the program where to find Read2 input data
--trimns
this tells the program to remove reads with ambiguous bases or ’N’s
--trimqualities
--minquality
this tells the program to remove nucleotide bases below a certain quality
--threads
this tells the computer how many threads to use when running the program
--gzip
this tells the program that both the input and output files are .gzip compressed
--output1
this tells the program where to place processed Read1 data
--output2
this tells the program where to place processed Read2 data
AdapterRemoval
is running
AdapterRemoval -h
You should see a long list of all the arguments that can be given to this tool.
mini_SL342389
AdapterRemoval --file1 mini_SL342389.R1.fastq.gz --file2 mini_SL342389.R2.fastq.gz --trimns --trimqualities --minquality 40 --threads 2 --gzip --output1 clean.mini_SL342389.R1.fastq.gz --output2 clean.mini_SL342389.R2.fastq.gz
Suggestion:Execute ls
to confirm
clean.mini_SL342389.R1.fastq.gz
and
clean.mini_SL342389.R2.fastq.gz
were generated.
Run AdapterRemoval on your assigned sample:
AdapterRemoval --file1 mini_my_reads.R1.fastq.gz --file2 mini_my_reads.R2.fastq.gz --trimns --trimqualities --minquality 40 --threads 2 --gzip --output1 clean.mini_my_reads.R1.fastq.gz --output2 clean.mini_my_reads.R2.fastq.gz
Suggestion:Execute ls
to confirm
clean.mini_my_reads.R1.fastq.gz
and
clean.mini_my_reads.R2.fastq.gz
were generated.
mini_SL342389
Look at the first 10 lines of your AdapterRemoval
Read1
and Read2 output files (i.e. “cleaned” or “clean” reads)
zcat clean.mini_SL342389.R1.fastq.gz | head
zcat clean.mini_SL342389.R2.fastq.gz | head
For comparison, look again at the first 10 lines of your initial “i.e.”raw”) reads:
zcat mini_SL342389.R1.fastq.gz | head
zcat mini_SL342389.R2.fastq.gz | head
Suggestion: Make note of any differences you see between the lengths of the Clean and Raw reads.
Count the number of reads in each of your cleaned fastq files by running this code and dividing the output by four:
zcat clean.mini_SL342389.R1.fastq.gz | wc -l
zcat clean.mini_SL342389.R2.fastq.gz | wc -l
Count the number of reads in each of your your raw (i.e. initial) fastq files by running this code and dividing the output by four:
zcat mini_SL342389.R1.fastq.gz | wc -l
zcat mini_SL342389.R2.fastq.gz | wc -l
Suggestion: Make note of any differences you see between the read counts of the Clean and Raw reads.
Look at your clean
… results
zcat clean.mini_my_reads.R1.fastq.gz | head
zcat clean.mini_my_reads.R2.fastq.gz | head
For comparison, look again at the first 10 lines of your raw reads:
zcat mini_my_reads.R1.fastq.gz | head
zcat mini_my_reads.R2.fastq.gz | head
Count the number of reads in each of your cleaned fastq files by running:
zcat clean.mini_reads.R1.fastq.gz | wc -l
zcat clean.mini_reads.R2.fastq.gz | wc -l
Count the number of reads in each of your raw (i.e. initial) fastq files by running:
zcat mini_reads.R1.fastq.gz | wc -l
zcat mini_reads.R2.fastq.gz | wc -l
Record your Raw and Clean read counts to report them in CP4.
Notice any differences between the Clean and Raw reads? If so, write them down.
In the Google Sheet, please: 1. Upload a screenshot of your zcat results. 2. Answer these questions:
Researchers often decide at this point to remove DNA from unwanted species, such as human. DNA that we don’t want is called “Contaminating DNA”. We are not going to perform this step due to time constraints.
kraken --gzip-compressed --fastq-input --threads 2 --paired --preload --db ~/databases/minikraken_20171101_8GB_dustmasked clean.mini_my_reads.R1.fastq.gz clean.mini_my_reads.R2.fastq.gz > mini_my_reads_taxonomic_report.csv
kraken
command invocation
--gzip-compressed
indicates our reads are gzip compressed
--fastq-input
indicates our reads are formatted as fastq files
--threads
tells the compute instance how many threads to use
--paired
indicates we’re using paired end data
--preload
specifies to load the database into RAM before running
--db
specifies the path to the database
clean.mini_my_reads.R1.fastq.gz
filepath to Read1 file
clean.mini_my_reads.R2.fastq.gz
filepath to Read2 file
mini_my_reads_taxonomic_report.csv
filepath to file wherekraken
will deposit results
conda install --channel bioconda --yes kraken
conda
is a package manager
install
is a subcommand ofconda
--yes
gives conda permission to install software
kraken
is the program we are installing
You should now be able to test Kraken by running:
kraken --help
--help
is an option that displays the options for the tool
If a list of kraken
options and explanations is
displayed, it’s confirmation the tool is running. Provided this happnes,
we can now perform taxonomic classification on our data using this
command.
mini_SL342389
kraken --gzip-compressed --fastq-input --threads 2 --paired --preload --db ~/databases/minikraken_20171101_8GB_dustmasked clean.mini_SL342389.R1.fastq.gz clean.mini_SL342389.R2.fastq.gz > mini_SL342389_taxonomic_report.csv
Output Confirm the mini_SL342389_taxonomic_report.csv
file is present using the ls
command.
head mini_SL342389_taxonomic_report.csv
Confirm the mini_SL342389_taxonomic_report.csv
file is
present using the ls
command.
head mini_SL342389_taxonomic_report.csv
(the head
command prints the first 10 lines of any file
to the terminal)
Note Column 1 indicates whether it was classified by Kraken or not. Column 2 denotes the Read ID (line 1 of the fastq read record). Subsequent columns indicated taxonomic mappings, if any.
kraken --gzip-compressed --fastq-input --threads 2 --paired --preload --db ~/databases/minikraken_20171101_8GB_dustmasked clean.mini_my_reads.R1.fastq.gz clean.mini_my_reads.R2.fastq.gz > mini_my_reads_taxonomic_report.csv
Confirm the file is present using the ls
command.
Once this command runs, take a look at the output by running:
The mini_SL342389_taxonomic_report.csv
is formatted for
storage efficiency and is not very useful for biological interpretation.
Convert this file to an .mpa
format so that it can be
interpreted more easily
mini_SL342389
kraken-mpa-report --db ~/databases/minikraken_20171101_8GB_dustmasked mini_SL342389_taxonomic_report.csv > mini_SL342389_final_taxonomic_results.mpa
Look at the first 20 lines by executing:
Look at the last 20 lines by executing:
Pull out the species level assignments in your .mpa
file
and store them in a .tsv
file:
mini_SL342389
Look at the first 10 lines:
Visually confirm that all taxonmomic assignments are at the species level.
See how many species were identified:
Look at the first 10 lines:
Visually confirm that all taxonmomic assignments are at the species level.
See how many species were identified:
In the Google Sheet, please:
Upload a screenshot of your kraken
command output.
(i.e. the one that states the % sequences classified and
unclassified)
Upload a screenshot of your
head -20 mini_my_reads_final_taxonomic_results.mpa
command.
Upload a screenshot of your
tail -20 mini_my_reads_final_taxonomic_results.mpa
command.
Answer these questions:
mini_SL342389
= read.csv('~/reads/mini_SL342389_species_only.tsv', header=FALSE, sep='\t')
taxa_mini_SL342389_species
<- taxa_mini_SL342389_species[order(taxa_mini_SL342389_species$V2, decreasing = TRUE), ]
taxa_mini_SL342389_species_ordered
head(taxa_mini_SL342389_species_ordered, 5)
Insert Video
<- read.csv('~/reads/mini_my_reads_species_only.tsv', header=FALSE, sep='\t')
taxa_yourSample_species
<- taxa_yourSample_species[order(taxa_yourSample_species$V2, decreasing = TRUE), ]
taxa_yourSample_species_ordered
head(taxa_yourSample_species_ordered, 5)
In the Google Sheet:
library(ggplot2)
<- read.csv('kraken_all_metasub.csv', header=TRUE, sep=',')
taxa_all_barplot
<- taxa_all_barplot[taxa_all_barplot$rank == 'phylum',]
taxa_all_barplot
<- taxa_all_barplot[taxa_all_barplot$percent_abundance >= 0.3,]
taxa_all_barplot
ggplot(taxa_all_barplot, aes(x=sample, y=percent_abundance, fill=taxon)) +
geom_bar(stat="identity") +
xlab('Sample') +
ylab('Abundance') +
labs(fill='Phylum') +
theme(axis.text.x = element_text(angle = 90))
Please upload the following to the Google Sheet:
Lookup: Abund. Spp. Resistance Loci
Navigate to the Bacterial and Viral Bioinformatics Resource Center (BV-BRC): https://www.bv-brc.org/
Use the tools in BV-BRC to ID at least one AMR locus/gene for the top 5 most abundant species in your sample.
TAXON OVERVIEW
from the green columnSpecialty Genes
tabFilter
buttonAntibiotic Resitance
efflux pump conferring antibiotic resistance
using the
magnifying glass featureFASTA
option in the green column and select
View FASTA DNA
This will show you the nucleotide sequences
of the selected resistance loci.Here’s a screencast using the species Staphylococcus aureus (Methicillin-resistant Staphylococcus aureus (MRSA) is a widely known antibiotic resistant hospital acquired infection):
In the Google Sheet:
efflux pump conferring antibiotic resistance
locus IDs for the species you queried using the BV-BRC.In this section, you will learn how to use R to generate plots that help visualize the similarities and differences between a subset of the Metasub samples and compare it to metagenomics samples from the Human Microbiome project. You will first see how two do this at the phylum and family levels using barplots (which you have done before) and then at the species level using a technique called Principal Component Analysis.
As humans, it’s easiest for us to see plots in two dimensions, but microbiome data is not inherently two dimensional. In a formal mathematical sense, microbiome data has a dimensionality equal to the number of microbes which can be identified – often thousands. However, most of these dimensions are highly correlated meaning, roughly, that most of the information about the abundance of some particular microbe can probably be well represented by using the abundance of a different microbe and a constant scaling factor.
In fact, it is possible to find sets of scaling factors that (likely) represent as much information as possible about a set of samples using a technique called Principal Component Analysis (or PCA for short). PCA is mathematically complex and we won’t explain it here but we will do an exercise to show how it can be used in R to plot your data in 2D.
The main idea of PCA is to “reduce the dimensionality of a data set consisting of many variables correlated with each other, either heavily or lightly, while retaining the variation present in the dataset, up to the maximum extent” 1.
What constitutes a dimension in our specific dataset? As mentioned in the instructions file: >In a formal mathematical sense, microbiome data has a dimensionality equal to the number of microbial species which can be identified – often thousands. However, most of these dimensions are highly correlated meaning, roughly, that most of the information about the abundance of some particular microbe can probably be well represented by using the abundance of a different microbe and a constant scaling factor.
As you can see, the when the PCA plot is colored by sample, you can see how well each samples type clusters and how similar they are to the other sample types.
PC = Principal Component. PC1 is the first principal component, PC 2 is the second principal component. Each principal component comprises a percentage of variance in the data. Here, we plot two principal components but many more can be computed. It’s often diminishing returns, however, and it’s not possible to (easily) visualize more than 2-3 (e.g. on X, Y, Z, axes), without doing more complicated analysis. A good rule of thumb: if a lot of variation is captured by the first two PCs, just plot those and be done:)
Sometimes you see each PCs contribution to variance in parentheses on each access; an example of this would be in the Human Microbiome Project Paper:An interesting exercise is to figure out what other PCs are computed using the script we provided and how much variance each accounts for.
I would interpret your PCA results as follows: the mystery samples don’t cluster nicely via PC1/PC2 like the Metasub (i.e. “environmental”) and GI HMP samples do. One sample seems to be similar on PC2 but not PC1, and vice versa. A third mystery sample seems to split the difference.
taxa_table.csv
library(ggplot2)
= read.csv('taxa_table.csv', header=TRUE, sep=',')
taxa
= taxa[taxa$rank == 'phylum',]
taxa
ggplot(taxa, aes(x=sample, y=percent_abundance, fill=taxon)) +
geom_bar(stat="identity") +
xlab('Sample') +
ylab('Abundance') +
labs(fill='Phylum') +
theme(axis.text.x = element_text(angle = 90))
taxa_table.csv
library(ggplot2)
= read.csv('taxa_table.csv', header=TRUE, sep=',')
taxa
= taxa[taxa$rank == 'family',]
taxa
ggplot(taxa, aes(x=sample, y=percent_abundance, fill=taxon)) +
geom_bar(stat="identity") +
xlab('Sample') +
ylab('Abundance') +
labs(fill='Family') +
theme(axis.text.x = element_text(angle = 90))
taxa_table.csv
library(ggplot2)
library(reshape2)
= read.csv('taxa_table.csv', header=TRUE, sep=',')
taxa = taxa[taxa$rank == 'species',]
taxa = acast(taxa, sample~taxon) # add some print statements and see if you can understand what this line does
taxa
= prcomp(taxa) # run PCA on our samples. look at the pca object and get a sense of its features
pca
= pca$x[,1:2] # grab the first two most important components from PCA. Play with some other sets.
principal_components
= data.frame(principal_components)
principal_components
print(principal_components)
$sample_type = unlist(
principal_componentslapply(rownames(principal_components), FUN=function(x){strsplit(x, '\\.')[[1]][1]})
)
$sample_type = unlist(lapply(principal_components$sample_type, FUN=function(x){
principal_componentsif(x == 'gastrointestinal' || x == 'mystery'){
return(x)
}return('environmental')
}))
= ggplot(principal_components, aes(x=PC1, y=PC2, color=sample_type)) +
p geom_point(size=3) +
labs(fill='Samples')
print(p)
Here is a video of this step being performed: https://youtu.be/3H2wGMWdflY
In the Google Sheet:
Upload the Phylum-level Barplot for the 9 samples.
Upload the Family-level Barplot for the 9 samples.
Upload the PCA plot for the nine samples.
Briefly compare the metasub, HMP Gastrointestional, and mystery sample composition to each other. Make an educated guess as to the source of the mystery sample.
Hypothesize how you would figure out if a resistant strain of one of the abundant species in your sample was present.
While the linux steps can be challenging to perform on a local computuer, students should be able to perform all R/RStudio related steps.
R is free and open source. The Desktop version of RStudio is free.
To use RStudio, you must download/install base R and download/install RStudio:Download and install Base R.
Download and install RStudio
kraken_all_metasub.csv
to your
local computer and create a barplot
Download the kraken_all_metasub.csv
file to your local
computer from your AWS-hosted RStudio instance.
library(ggplot2)
<- read.csv('~/Downloads/kraken_all_metasub.csv', header=TRUE, sep=',')
taxa_all_barplot
<- taxa_all_barplot[taxa_all_barplot$rank == 'phylum',]
taxa_all_barplot
<- taxa_all_barplot[taxa_all_barplot$percent_abundance >= 0.3,]
taxa_all_barplot
ggplot(taxa_all_barplot, aes(x=sample, y=percent_abundance, fill=taxon)) +
geom_bar(stat="identity") +
xlab('Sample') +
ylab('Abundance') +
labs(fill='Phylum') +
theme(axis.text.x = element_text(angle = 90))
head
, tail
) to
understand the structure/format of the data.$
or @
) to filter or
sort data by specific variables (i.e. columns)summary()
, density()
and plot
to understand the descriptive statistics and
general distribution of a dataset.We would greatly appreciate it if you could complete the POST-VTP Survey: https://forms.gle/ogTzjPW61S4WfVou6