This guide is intended to teach you how to analyze a single drosophila scRNA-seq transcriptome sample. You will start with the raw counts and end by making plots that show the prevalence of genes and cell types present in your sample. This project assumes you have a sample available and that you have a machine with 32GB of available RAM.

N.B. We have chosen to have you start by analyzing an individual sample through Clustering, PCA, tSNE, and Biomarker and Cell-type Assignment. When conducting an analysis in real life, individual samples are often QC’d and then merged prior to downstream analysis. If there is time and interest, we can do a merged analysis once you finish analyzing your assigned individual sample.

Overall Approach:

The analysis of your individual sample has 11 parts:

(Optional) In the Linux Terminal:
1. SSH into your AWS HPC instance 2. Navigate to fastqs 3. Look at your fastqs to understand their structure. (We are not going to take you through alignment and count generation, as genomics cores typically do this for you. Please ask any questions you have about this in the group meeting.)

In RStudio:
4. Load Packages + Ingest Data 5. Basic quality control analysis
6. Data filtering/normalization
7. Scaling
8. Principal Component Analysis
9. tSNE Clustering
10. Identifying biomarkers for cell types
11. Generating dot plots
12. Assigning cell types

(4) Load Packages + Ingest Data

Load the necessary R packages:


(Option 1A) Read in your CellRanger matrix data:

Adult1_mtx <- Read10X("./data/matrix_count_data/Matrices/Adult1/Drosophila_ref_genome/")

Please note: Adult1 is a placeholder. You should use the file for your assigned sample. Your assigned sample can be found in the gSheet.

(Option 1B) Read in count .csv data

Adult1_mtx_initial <- read.csv(file = "./data/count_data/Adult1.csv", header = T, row.names = 1)

intermediate_matrix_1 <- data.table(Adult1_mtx_initial)
intermediate_matrix_2<-sparsify(intermediate_matrix_1, sparsifyNAs = FALSE, naCols = "none")

Adult1_mtx <- intermediate_matrix_2 


(5) Basic quality control analysis

In this step, you will gauge library saturation and generate knee plots.

Testing for library saturation

Now that you have the transcript counts of different genes in your sample, you should gauge the library saturation of your sample. To do this, you will sum the read counts across all genes for each cell, and then input this value into the tibble() function to calculate library saturation. You can visualize library saturation by plotting the gene number by read count (i.e. transcript molecule number) to gauge the relative frequency of read counts across your gene pool.

tot_counts <- colSums(Adult1_mtx)
lib_sat <- tibble(nCount = tot_counts,
                  nGene = colSums(Adult1_mtx> 0))

options(repr.plot.width=9, repr.plot.height=6)
ggplot(lib_sat, aes(nCount, nGene)) +
  geom_point(alpha = 0.1, size = 0.5) +
  scale_x_log10() + scale_y_log10() + annotation_logticks()

This plot is very misleading, as even the small alpha can’t accurately show how many points are stacked at one location, thus binning these points will allow us to better represent these data.

ggplot(lib_sat, aes(nCount, nGene)) +
  geom_bin2d(bins = 50) +
  scale_fill_scico(palette = "devon", direction = -1, end = 0.95) +
  scale_x_log10() + scale_y_log10() + annotation_logticks()

The “count” label in the legend here refers to the number of cells that have a given combination of nCounts and nGenes

At this point, you need to filter out empty or near empty droplets that have no reads in them. To do this, you start by ranking all the barcodes by the total count.

bc_rank <- barcodeRanks(Adult1_mtx, lower = 10)

Examine the knee plot

The “knee plot” is a standard single-cell RNA-seq quality control that is used to determine a threshold for considering cells valid for analysis in an experiment. To make the plot, cells are ordered on the x-axis according to the number of distinct UMIs observed. The y-axis displays cell rank by the number of distinct UMIs for each barcode (here barcodes are proxies for cells). High quality barcodes are located on the right hand side of the plot, and thresholding is performed by identifying the “knee” on the curve.

Create Knee Plot Function

knee_plot <- function(bc_rank) {
  knee_plt <- tibble(rank = bc_rank[["rank"]],
                    total = bc_rank[["total"]]) %>%
              distinct() %>%
              dplyr::filter(total > 0) # nothing has total > 0 here
  annot <- tibble(inflection = metadata(bc_rank)[["inflection"]],
                  rank_cutoff = max(bc_rank$rank[bc_rank$total > metadata(bc_rank)[["inflection"]]]))
  p <- ggplot(knee_plt, aes(total, rank)) +
    geom_line() +
    geom_hline(aes(yintercept = rank_cutoff), data = annot, linetype = 2) +
    geom_vline(aes(xintercept = inflection), data = annot, linetype = 2) +
    scale_x_log10() +
    scale_y_log10() +
    annotation_logticks() +
    labs(y = "Rank", x = "Total UMIs")

Generate Knee Plot

options(repr.plot.width=9, repr.plot.height=6)


Ideally one droplet contains one cell, one bead (which has thousands attached oligos, each with a different UMI, but same barcode, adapter sequence, and polyT tail), and reverse transcriptase. Once the RT happens, each cDNA from that one cell has a distinct UMI with the same barcode. An empty droplet does not contain a cell but will still contain “ambient” RNA (i.e. cell-free transcripts in the solution in which the cells are suspended. To avoid such transcripts in our data we have to remove the (near) empty droplets/cells

(6) Filter and normalize data

Filtering empty droplets

This step will typically filter out your empty droplets based on the inflection point of the knee chart above. If its counts are above the inflection point (metadata(bc_rank)$inflection== ##), the drop is kept; if its counts are below the inflection point, the drop is discarded.

Here we are showing you how to filter using the inflection point, but have commented out the code becuase we don’t want you to run it.

Take a look at the current dimensions of the data.batch1 object:


Now perform filtering:

This allows you to filter out any drops to the left of the inflection point:

#Adult1_mtx <- Adult1_mtx[, tot_counts > metadata(bc_rank)$inflection]

The Desplan lab has pre-filtered some of this raw data so we’ve commented out this code, but wanted to show you how it would be done.

Here we remove any genes that have zero UMIs:

Adult1_mtx <- Adult1_mtx[Matrix::rowSums(Adult1_mtx) > 0, ]

Take a look at the new dimensions of the YourMatrixObject, after filtering:


Note how many genes and cells were filtered out of your sample.

Checkpoint (CP) 1: Please upload your results up to this point in a SINGLE POST on the group Slack channel. Feel free to concisely explain your results and to ask clarifying questions.


First, initialize a ‘Seurat object’ with the raw, non-normalized data. A Seurat object is a data object that effectively stores and permits facile manipulation of single cell data. Here you will import YourMatrixObject into the Seurat object, determine what percentage of read counts from each cell are mitocondrial, and then plot various features of your Seurat object (nFeature_RNA, nCount_RNA, and, which we describe below.

nFeature_RNA is the number of genes detected in each cell.
nCount_RNA is the total number of transcript molecules detected within each cell. is the percent of counts in the cell that align to mitochondrial genes.

Low nFeature_RNA can mean that your cell is dead/dying, that the cell membrane is “holey” and leaking mRNA, or that the droplet is empty. High nCount_RNA or high nFeature_RNA can mean that you have a doublet. Thresholding these paraeters can help you to remove empty droplets, dead/dying/unhealthy cells, or doublet droplets from your data and are important in data filtering.

One can also filter out genes based on the number of cells they occur in (e.g., min.cell=3 for genes that occur in at least 3 cells) and the minimum number of genes a cell must have to be included (i.e. min.features = 200 for cells that have at least 200 genes ecpressed).

Create your Seurat object.

# You can name your Seurat object whatever you like, but suggest naming it after your assigned sample ID.

Adult1_seurat <- CreateSeuratObject(counts = Adult1_mtx, project = "Adult1", min.cells = 3, min.features = 200) 

# Tell Seurat which genomic features are mitochondrial genes: 
Adult1_seurat[[""]] <- PercentageFeatureSet(Adult1_seurat, pattern = "^mt.") 


# Look at the 
VlnPlot(Adult1_seurat, features = c("nFeature_RNA", "nCount_RNA", ""), ncol = 3)

FeatureScatter is typically used to visualize feature-feature relationships, but can be used for anything calculated by the Seurat object (i.e. columns in object metadata, PC scores etc.).

We expect to see a strong relationship between the number of genes (nFeature_RNA) and the number of molecules (nCount_RNA). However, the nCount_RNA and the will likely have some cells that have a high percent of mitocondrial genes but low count numbers, which are the cells that are likely dead/dying and need to be filtered out.

plot1 <- FeatureScatter(Adult1_seurat, feature1 = "nCount_RNA", feature2 = "")

plot2 <- FeatureScatter(Adult1_seurat, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")

CombinePlots(plots = list(plot1, plot2))

CP 2: Please upload your results up to this point in a SINGLE POST on the group Slack channel. Feel free to concisely explain your results and to ask clarifying questions.

filter data using mitochondrial percentage and UMI counts

# these 
Adult1_seurat <- subset(Adult1_seurat, subset = < 10 & nCount_RNA > 1000 & nCount_RNA < 20000)  

Normalize your counts.

Adult1_seurat <- NormalizeData(Adult1_seurat, normalization.method = "LogNormalize", scale.factor = 10000)
Adult1_seurat <- FindVariableFeatures(Adult1_seurat, selection.method = "vst", nfeatures = 2000)

Now that you’ve filtered your data, you can reliably identify the genes with the most variable expression in your sample. Identify the 10 most highly variable genes in your sample:

top10 <- head(VariableFeatures(Adult1_seurat), 10)

# Plot variable features with and without labels

plot1 <- VariableFeaturePlot(Adult1_seurat, log = FALSE)

LabelPoints(plot = plot1, points = top10, repel = TRUE)

CP 3: Please upload your results up to this point in a SINGLE POST on the group Slack channel. Feel free to concisely explain your results and to ask clarifying questions.

(7) Scaling the data

Next, we apply a linear transformation (‘scaling’) that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The ScaleData function shifts the expression of each gene, so that the mean expression across cells is 0 and the variance across cells is 1.

This step gives equal weight to genes in downstream analyses, so that highly-expressed genes do not dominate. The results of this are stored in dataset[[“RNA”]] We apply this only to the genes identified as highly variable, which is default function.

Adult1_seurat <- ScaleData(Adult1_seurat)

all.genes <- rownames(Adult1_seurat)

Adult1_seurat <- ScaleData(Adult1_seurat, features = all.genes)

The scaling does not affect Princeipal Component Analysis (PCA) or clustering results. However, Seurat heatmaps (produced as shown below with DoHeatmap) require genes in the heatmap to be scaled so that highly-expressed genes don’t dominate. To make sure we don’t leave any genes out of the heatmap later, we are scaling all genes in this project. We can also use the ScaleData function to remove unwanted sources of variation from a single-cell dataset. For example, we could ‘regress out’ heterogeneity associated with cell cycle stage, or mitochondrial contamination.

(8) Principal Component Analysis

Determining dimensionality

To overcome the extensive technical noise in any single feature for scRNA-seq data, one can cluster cells based on their PCA projections, with each PC essentially representing a ‘metafeature’ that combines information across a correlated feature set.

A common heuristic method generates an ‘Elbow plot’: a ranking of principle components based on the percentage of variance explained by each PC (ElbowPlot() function).

Adult1_seurat <- RunPCA(Adult1_seurat, features = VariableFeatures(object = Adult1_seurat), npcs = 120)

ElbowPlot(Adult1_seurat, ndims = 120)

In this example, we can observe an ‘elbow’ around PC16-17, suggesting that the majority of true signal is captured in those PCs.

Examine and visualize PCA results

print(Adult1_seurat[["pca"]], dims = 1:5, nfeatures = 5)

VizDimLoadings(Adult1_seurat, dims = 1:2, reduction = "pca")

DimPlot(Adult1_seurat, reduction = "pca")

DimHeatmap(Adult1_seurat, dims = 1, cells = 500, balanced = TRUE)

DimHeatmap(Adult1_seurat, dims = 1:15, cells = 500, balanced = TRUE)

FeaturePlot(Adult1_seurat, reduction = "pca", feature = "Tim17b")

CP 4: Please upload your results up to this point in a SINGLE POST on the group Slack channel. Feel free to concisely explain your results and to ask clarifying questions.

NOTE: The following process can take a long time for big datasets s we have commented it out. More approximate techniques such as those implemented in ElbowPlot() above can be used as an alternative to reduce the computation time.

The JackStraw can help you figure out how many of the PCs are signifigant to include in your final clustering (note the p values in the label of the plot).

#Adult1_seurat <- JackStraw(Adult1_seurat, num.replicate = 100)

#Adult1_seurat <- ScoreJackStraw(Adult1_seurat, dims = 1:20)

#JackStrawPlot(Adult1_seurat, dims = 1:20)  # plotting the JackStraw

#dims would need to be changed according to desicred PC inclusion number

Similarly, as mentioned above, you can use #ElbowPlot() to determine the number of PCs to use in your analysis, the PCs before the elbow will be more useful to include in your analysis. In this project, we would like you to use both the ElbowPlot() and JackStrawPlot() to gauge PC inclusion for downstream analysis.

Please upload your results up to this point on the group channel explain your results in a single post.

(9) tSNE clustering

Next, you will use tSNE clustering to find cell clusters. These are groups of individual cells in your sample with statistically similar transcriptomic profiles. Clustering is often used to identify cell types, as individual cell types tend to have distinct transcriptomic profiles.

For a review of what cell types are and why they tend to have distinct transcriptomic profiles, check out Arendt et al., 2016 (DOI: 10.1038/nrg.2016.127).

Also, you can change the resolution of the FindClusters() command to produce different numbers of clusters. This might be useful if you’re looking at cell types that are particularly rare or if you suspect that there are more cell types based on the literature than what you initially find with the default resolution.

Adult1_seurat <- FindNeighbors(Adult1_seurat, dims = 1:120)

Adult1_seurat <- FindClusters(Adult1_seurat, resolution = 10) 

We have chosen the dims and resolution parameters based on the methods section of the paper. Iteration is often require to determine optimal parameters.

Look at cluster IDs of the first 5 cells:

head(Idents(Adult1_seurat), 5) 

If you look back at the plot we made using only the first two PCs, you aren’t able to see the clusters very well but using tSNE, which will take into account more than just your first two PCs, you can better visualize the clusters. UMAP is another way of doing this dimensional reduction to visualize cell clusters. We are happy to discuss the distinction between UMAP and TSNE in the group meetings if you would like.

Note: you can set label = TRUE or use the LabelClusters function to help label individual clusters

Generate TSNE:

Adult1_seurat<-RunTSNE(Adult1_seurat, dims = 1:120)

TSNEPlot(Adult1_seurat, label = TRUE) + xlim(-50,50) + ylim(-50,50)

CP 5: Please upload your results up to this point in a SINGLE POST on the group Slack channel. Feel free to concisely explain your results and to ask clarifying questions.

(10) Identifying biomarkers for cell types

Let’s find markers for every cluster compared to all remaining cells, reporting only the positive ones:

Adult1_seurat.markers <- FindAllMarkers(Adult1_seurat, only.pos = TRUE, min.pct = 0.1, logfc.threshold = 0.25)

Adult1_seurat.markers_table<-data.frame(Adult1_seurat.markers %>% group_by(cluster) %>% top_n(n = 15, wt = avg_logFC))

#change 'n =' parameter to choose how many markers to include in your table

You can plot the the violin plot distribution of any gene by cluster:

VlnPlot(Adult1_seurat, features = c("Tim17b", "Wnt4"))

N.B. these genes are placeholders, feel free investigate your genes of choice in addiation to these:

You can plot raw counts as well:

VlnPlot(Adult1_seurat, features = c("Tim17b", "Wnt4"), slot = "counts", log = TRUE)

CP 6: Please upload your results up to this point in a SINGLE POST on the group Slack channel. Feel free to concisely explain your results and to ask clarifying questions.

Let’s plot the expression of a gene on our TSNE plot. Here we use Tim17b, a receptor for the inhibitory neurotransmitter GABA. Look at how this receptor is distributed across clusters.

FeaturePlot(Adult1_seurat, features = c("Tim17b"))

Try typing in a gene of interest to visualize it’s expression.

Now let’s plot expression of multiple genes on our TSNE plot. Again, these are just example genes and can be replaced with the genes of your choice:

FeaturePlot(Adult1_seurat, features = c("Gs2", "CG14989", "kn", "sosie"))

Get top 10 marker genes from each cluster:

top10_markers <- Adult1_seurat.markers_table %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)

Heatmap of normalized expression of top 10 marker genes from each cluster:

DoHeatmap(Adult1_seurat, features = top10_markers$gene) + NoLegend()

CP 7: Please upload your results up to this point in a SINGLE POST on the group Slack channel. Feel free to concisely explain your results and to ask clarifying questions.

(11) Make DotPlots of gene expression markers for each cluster:

Create a DotPlot with gene names listed in code:

DotPlot(Adult1_seurat, features = c("Gs2", "CG14989", "kn", "sosie")) + RotatedAxis()

Create a DotPlot with gene names listed in code:

top_marker_each_cluster <- Adult1_seurat.markers %>% group_by(cluster) %>% top_n(n = 1, wt = avg_logFC)

top_marker_list <- unique(top_marker_each_cluster$gene)

DotPlot(Adult1_seurat, features = rev(top_marker_list)) + RotatedAxis()

(12) Assign cell types

You’ve just identified reliable marker genes for each of your clusters. Normally, you would make a prediction for what cell type you think each cluster corresponds to and re-label the Y-axis of your Dot Plot accordingly.

Because we are dealing with so many clusters,it is very challenging to use the makers alone to assign clusters on an ad hoc basis.

Bulk data

Here we use bulk rna-seq datasets that have been generated from FACS sorted cell popultions.

Ingest the data:

Katarina_data = read.csv("./data/Full_katarina_data_converted.csv", row.names = 1)
Davis_data = read.csv("./data/Full_davis_data_converted.csv", row.names = 1)

Generate log-normalized expression values of the bulk RNAseq datasets:

Seurat_D_data = CreateSeuratObject(counts = Davis_data)
Seurat_D_data = NormalizeData(Seurat_D_data)

Seurat_K_data = CreateSeuratObject(counts = Katarina_data)
Seurat_K_data = NormalizeData(Seurat_K_data)

Generate an average expression value for each gene in each cluster

Average_Expression_Adult1_seurat = AverageExpression(Adult1_seurat, assay = 'RNA', return.seurat = T)

Generate list of top 10 genes per cluster for pearson correlation to bul RNAseq data

comparison_genes = top10_markers$gene

correlation with Katarina data

#RAW data:
genes_in_common_K = intersect(rownames(Katarina_data), comparison_genes)
toto_K = cor(as.matrix([email protected][email protected])[genes_in_common_K,],
             as.matrix([email protected][email protected])[genes_in_common_K,])


     xlab = "Index of the clusters",
     ylab = "Correlation with Repo transcriptome")

     xlab = "Index of the clusters",
     ylab = "Correlation with Elav_K transcriptome")

bla = toto_K[order(-toto_K[,"Repo_K"]),"Repo_K"]
bla = bla[bla > 0.50]

correlation with Davis data

#RAW data:
genes_in_common_D = intersect(rownames(Davis_data), comparison_genes)
toto_D = cor(as.matrix([email protected][email protected])[genes_in_common_D,],
             as.matrix([email protected][email protected])[genes_in_common_D,])


# Martina, here's the code for this plot:
       xlab = "Index of the clusters",
       ylab = "Correlation with C2 transcriptome",
       main = "C2")

bla2 = toto_K[order(-toto_K[,"Repo_K"]),"Repo_K"]
bla2 = bla2[bla2 > 0.50]

CP 8 Please upload your cell types results to this point on the group channel and explain your results.