Following https://doi.org/10.1038/s41586-020-2879-3

This guide is intended to teach you how to analyze a single drosophila scRNA-seq transcriptome sample. You will start with the raw counts and end by making plots that show the prevalence of genes and cell types present in your sample. This project assumes you have a sample available and that you have a machine with 32GB of available RAM.

N.B. We have chosen to have you start by analyzing an individual sample through Clustering, PCA, tSNE, and Biomarker and Cell-type Assignment. When conducting an analysis in real life, individual samples are often QC’d and then merged prior to downstream analysis. If there is time and interest, we can do a merged analysis once you finish analyzing your assigned individual sample.

Overall Approach:

The analysis of your individual sample has 11 parts:

(Optional) In the Linux Terminal:
1. SSH into your AWS HPC instance 2. Navigate to fastqs 3. Look at your fastqs to understand their structure. (We are not going to take you through alignment and count generation, as genomics cores typically do this for you. Please ask any questions you have about this in the group meeting.)

In RStudio:
4. Load Packages + Ingest Data 5. Basic quality control analysis
6. Data filtering/normalization
7. Scaling
8. Principal Component Analysis
9. tSNE Clustering
10. Identifying biomarkers for cell types
11. Generating dot plots
12. Assigning cell types

(4) Load Packages + Ingest Data

Load the necessary R packages:

Read in your .csv data file:

Please note: Adult_1d_F.csv is a placeholder. You should use the file for your assigned sample. Your assigned sample can be found in the gSheet.

Next, we’ll convert this numeric matrix into a sparse dgmatrix. This is common practice becuase it uses a smaller storage footprint, which is especially important when working with large data (e.g. completing a merged analysis on all samples). It principally does this by converting “0”’s values to “.”’s. However, downstream analysis will also work on the normal numeric matrix.

(5) Basic quality control analysis

In this step, you will gauge library saturation and generate knee plots.

Testing for library saturation

Now that you have the transcript counts of different genes in your sample, you should gauge the library saturation of your sample. To do this, you will sum the read counts across all genes for each cell, and then input this value into the tibble() function to calculate library saturation. You can visualize library saturation by plotting the gene number by read count (i.e. transcript molecule number) to gauge the relative frequency of read counts across your gene pool.

This plot is very misleading, as even the small alpha can’t accurately show how many points are stacked at one location, thus binning these points will allow us to better represent these data.

The “count” label in the legend here refers to the number of cells that have a given combination of nCounts and nGenes

At this point, you need to filter out empty or near empty droplets that have no reads in them. To do this, you start by ranking all the barcodes by the total count.

Examine the knee plot

The “knee plot” is a standard single-cell RNA-seq quality control that is used to determine a threshold for considering cells valid for analysis in an experiment. To make the plot, cells are ordered on the x-axis according to the number of distinct UMIs observed. The y-axis displays cell rank by the number of distinct UMIs for each barcode (here barcodes are proxies for cells). High quality barcodes are located on the right hand side of the plot, and thresholding is performed by identifying the “knee” on the curve.

Create Knee Plot Function

Generate Knee Plot

Ideally one droplet contains one cell, one bead (which has thousands attached oligos, each with a different UMI, but same barcode, adapter sequence, and polyT tail), and reverse transcriptase. Once the RT happens, each cDNA from that one cell has a distinct UMI with the same barcode. An empty droplet does not contain a cell but will still contain “ambient” RNA (i.e. cell-free transcripts in the solution in which the cells are suspended. To avoid such transcripts in our data we have to remove the (near) empty droplets/cells

(6) Filter and normalize data

Filtering empty droplets

This step will typically filter out your empty droplets based on the inflection point of the knee chart above. If its counts are above the inflection point (metadata(bc_rank)$inflection== ##), the drop is kept; if its counts are below the inflection point, the drop is discarded.

Here we are showing you how to filter using the inflection point, but have commented out the code becuase we don’t want you to run it.

Take a look at the current dimensions of the data.batch1 object:

## [1] 12028  5948

Now perform filtering:

This allows you to filter out any drops to the left of the inflection point:

The Desplan lab has pre-filtered some of this raw data so we’ve commented out this code, but wanted to show you how it would be done.

Here we remove any genes that have zero UMIs:

Take a look at the new dimensions of the YourMatrixObject, after filtering:

## [1] 10100  5948

Note how many genes and cells were filtered out of your sample.

Checkpoint (CP) 1: Please upload your results up to this point in a SINGLE POST on the group Slack channel. Feel free to concisely explain your results and to ask clarifying questions.

Normaliziing

First, initialize a ‘Seurat object’ with the raw, non-normalized data. A Seurat object is a data object that effectively stores and permits facile manipulation of single cell data. Here you will import YourMatrixObject into the Seurat object, determine what percentage of read counts from each cell are mitocondrial, and then plot various features of your Seurat object (nFeature_RNA, nCount_RNA, and percent.mt), which we describe below.

nFeature_RNA is the number of genes detected in each cell.
nCount_RNA is the total number of transcript molecules detected within each cell.
percent.mt is the percent of counts in the cell that align to mitochondrial genes.

Low nFeature_RNA can mean that your cell is dead/dying, that the cell membrane is “holey” and leaking mRNA, or that the droplet is empty. High nCount_RNA or high nFeature_RNA can mean that you have a doublet. Thresholding these paraeters can help you to remove empty droplets, dead/dying/unhealthy cells, or doublet droplets from your data and are important in data filtering.

One can also filter out genes based on the number of cells they occur in (e.g., min.cell=3 for genes that occur in at least 3 cells) and the minimum number of genes a cell must have to be included (i.e. min.features = 200 for cells that have at least 200 genes ecpressed).

Create your Seurat object.

##                    orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACCTGAGGGTTTCT_7 Adult_1d_F       1071          629   4.014939
## AAACCTGAGTCCAGGA_7 Adult_1d_F       1576          808   3.680203
## AAACCTGCAAAGGAAG_7 Adult_1d_F       3925         1649   2.904459
## AAACCTGCACCTGGTG_7 Adult_1d_F       1025          618   3.219512
## AAACCTGGTATCGCAT_7 Adult_1d_F       2986         1343   3.750837
## AAACCTGGTGCACTTA_7 Adult_1d_F       2224         1044   4.901079
## AAACCTGGTGTCTGAT_7 Adult_1d_F       2651         1189   2.640513
## AAACCTGTCATACGGT_7 Adult_1d_F       1487          777   6.388702
## AAACCTGTCCGGGTGT_7 Adult_1d_F       1827          843   4.159825
## AAACCTGTCCGTTGTC_7 Adult_1d_F       4370         1440   3.386728

FeatureScatter is typically used to visualize feature-feature relationships, but can be used for anything calculated by the Seurat object (i.e. columns in object metadata, PC scores etc.).

We expect to see a strong relationship between the number of genes (nFeature_RNA) and the number of molecules (nCount_RNA). However, the nCount_RNA and the percent.mt will likely have some cells that have a high percent of mitocondrial genes but low count numbers, which are the cells that are likely dead/dying and need to be filtered out.

CP 2: Please upload your results up to this point in a SINGLE POST on the group Slack channel. Feel free to concisely explain your results and to ask clarifying questions.

filter data using mitochondrial percentage and UMI counts

Normalize your counts.

Now that you’ve filtered your data, you can reliably identify the genes with the most variable expression in your sample. Identify the 10 most highly variable genes in your sample:

CP 3: Please upload your results up to this point in a SINGLE POST on the group Slack channel. Feel free to concisely explain your results and to ask clarifying questions.

(7) Scaling the data

Next, we apply a linear transformation (‘scaling’) that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The ScaleData function shifts the expression of each gene, so that the mean expression across cells is 0 and the variance across cells is 1.

This step gives equal weight to genes in downstream analyses, so that highly-expressed genes do not dominate. The results of this are stored in dataset[[“RNA”]]@scale.data. We apply this only to the genes identified as highly variable, which is default function.

The scaling does not affect Princeipal Component Analysis (PCA) or clustering results. However, Seurat heatmaps (produced as shown below with DoHeatmap) require genes in the heatmap to be scaled so that highly-expressed genes don’t dominate. To make sure we don’t leave any genes out of the heatmap later, we are scaling all genes in this project. We can also use the ScaleData function to remove unwanted sources of variation from a single-cell dataset. For example, we could ‘regress out’ heterogeneity associated with cell cycle stage, or mitochondrial contamination.

(8) Principal Component Analysis

Determining dimensionality

To overcome the extensive technical noise in any single feature for scRNA-seq data, one can cluster cells based on their PCA projections, with each PC essentially representing a ‘metafeature’ that combines information across a correlated feature set.

A common heuristic method generates an ‘Elbow plot’: a ranking of principle components based on the percentage of variance explained by each PC (ElbowPlot() function).

In this example, we can observe an ‘elbow’ around PC16-17, suggesting that the majority of true signal is captured in those PCs.

Examine and visualize PCA results

## PC_ 1 
## Positive:  CR34335, BM-40-SPARC, Cyp28d1, CG31705, Cg25C 
## Negative:  cpx, Rdl, futsch, Ggamma30A, Cngl 
## PC_ 2 
## Positive:  CG10226, CG10550, CG8837, Tret1-1, CG6126 
## Negative:  CG8369, nrv2, Tsp5D, CG1552, Msr-110 
## PC_ 3 
## Positive:  CG34362, CG31221, acj6, CG42750, klg 
## Negative:  orb, CG14274, CG15522, ome, Eaat1 
## PC_ 4 
## Positive:  ST6Gal, CG15522, acj6, CG14340, SP1029 
## Negative:  Ald, CG10804, cpx, CG10186, CG13739 
## PC_ 5 
## Positive:  Tk, Pka-C3, CG17193, bsh, CG32647 
## Negative:  bru-3, jdp, CG45263, luna, sm

CP 4: Please upload your results up to this point in a SINGLE POST on the group Slack channel. Feel free to concisely explain your results and to ask clarifying questions.

NOTE: The following process can take a long time for big datasets s we have commented it out. More approximate techniques such as those implemented in ElbowPlot() above can be used as an alternative to reduce the computation time.

The JackStraw can help you figure out how many of the PCs are signifigant to include in your final clustering (note the p values in the label of the plot).

Similarly, as mentioned above, you can use #ElbowPlot() to determine the number of PCs to use in your analysis, the PCs before the elbow will be more useful to include in your analysis. In this project, we would like you to use both the ElbowPlot() and JackStrawPlot() to gauge PC inclusion for downstream analysis.

Please upload your results up to this point on the group channel explain your results in a single post.

(9) tSNE clustering

Next, you will use tSNE clustering to find cell clusters. These are groups of individual cells in your sample with statistically similar transcriptomic profiles. Clustering is often used to identify cell types, as individual cell types tend to have distinct transcriptomic profiles.

For a review of what cell types are and why they tend to have distinct transcriptomic profiles, check out Arendt et al., 2016 (DOI: 10.1038/nrg.2016.127).

Also, you can change the resolution of the FindClusters() command to produce different numbers of clusters. This might be useful if you’re looking at cell types that are particularly rare or if you suspect that there are more cell types based on the literature than what you initially find with the default resolution.

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 5945
## Number of edges: 303224
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.6574
## Number of communities: 86
## Elapsed time: 2 seconds

We have chosen the dims and resolution parameters based on the methods section of the paper. Iteration is often require to determine optimal parameters.

Look at cluster IDs of the first 5 cells:

## AAACCTGAGGGTTTCT_7 AAACCTGAGTCCAGGA_7 AAACCTGCAAAGGAAG_7 AAACCTGCACCTGGTG_7 
##                 13                 32                 11                 75 
## AAACCTGGTATCGCAT_7 
##                 52 
## 84 Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ... 83

If you look back at the plot we made using only the first two PCs, you aren’t able to see the clusters very well but using tSNE, which will take into account more than just your first two PCs, you can better visualize the clusters. UMAP is another way of doing this dimensional reduction to visualize cell clusters. We are happy to discuss the distinction between UMAP and TSNE in the group meetings if you would like.

Note: you can set label = TRUE or use the LabelClusters function to help label individual clusters

Generate TSNE:

CP 5: Please upload your results up to this point in a SINGLE POST on the group Slack channel. Feel free to concisely explain your results and to ask clarifying questions.

(10) Identifying biomarkers for cell types

Let’s find markers for every cluster compared to all remaining cells, reporting only the positive ones:

You can plot the the violin plot distribution of any gene by cluster:

N.B. these genes are placeholders, feel free investigate your genes of choice in addiation to these:

You can plot raw counts as well:

CP 6: Please upload your results up to this point in a SINGLE POST on the group Slack channel. Feel free to concisely explain your results and to ask clarifying questions.

Let’s plot the expression of a gene on our TSNE plot. Here we use Tim17b, a receptor for the inhibitory neurotransmitter GABA. Look at how this receptor is distributed across clusters.

Try typing in a gene of interest to visualize it’s expression.

Now let’s plot expression of multiple genes on our TSNE plot. Again, these are just example genes and can be replaced with the genes of your choice:

Get top 10 marker genes from each cluster:

Heatmap of normalized expression of top 10 marker genes from each cluster:

CP 7: Please upload your results up to this point in a SINGLE POST on the group Slack channel. Feel free to concisely explain your results and to ask clarifying questions.

(11) Make DotPlots of gene expression markers for each cluster:

Create a DotPlot with gene names listed in code:

Create a DotPlot with gene names listed in code:

(12) Assign cell types

You’ve just identified reliable marker genes for each of your clusters. Now, make a prediction for what cell type you think each cluster corresponds to. Feel free to re-label the Y-axis of your Dot Plot accordingly.

CP 8 Please upload your results to this point on the group channel and explain your results in