Please complete the Pre-VTP Survey.
In this Image Analysis + Neuroanatomy Virtual Training Project (VTP), you will use the image analyis software ImageJ and the Python programming language to investigate the evolution of brain anatomy from the fish species astyanax mexicanus, towards the goal of helping us understand the correlation between changes in the size and shape of brain regions and their functions.
Throughout this process, you will learn about image analysis, segmentation, data visualizetion, basic statistics and clustering.
How are the computational/data science skills learned in this project useful?
In case you are thinking, "How could I use this knowledge in real life?", here are several real-world applications of the techniques you will learn:
Research Applications
: you can ask interesting questions about fundamental biology that impact our understanding of...
you'll analyze in this VTP comes from, would be an example of this
application.
Clinical Applications
:
Biotech/Industry Applications
:
Tech applications
In a broader sense, the skills you'll learn are a subdivision of data science. We'll point out data science principles as we go along.
The source of the data you will analyze in this VTP is from the research study A brain-wide analysis maps structural evolution to distinct anatomical modules, which explores the evolution of brain anatomy in the blind Mexican cavefish, using advanced imaging techniques to analyze variations in brain region shape and volume and investigate the influence of genetics in brain-wide anatomical evolution.
The brain's overall topology remains highly conserved across vertebrate lineages, but individual brain regions can vary significantly in size and shape.
Novel anatomical changes in the brain are built upon ancestral anatomy, leading to the development of new regions that expand function and functional repertoire.
There are two main ideas about how the brain evolves:
The first hypothesis suggests that different parts of the brain tend to evolve together. This means that selection acts on mechanisms that control the growth of all regions of the brain at the same time.
The second hypothesis proposes that selection can also act on individual brain regions. According to this idea, regions of the brain that have similar functions will change together in their anatomy, even if other brain regions are not affected.
Volume and shape govern anatomical variation, but it is unclear if similar or distinct mechanisms drive these parameters, and their relationship is poorly understood.
Anatomical differences in the brain are influenced by both its volume (size) and shape, but scientists are not sure yet if they are controlled by the same or different mechanisms.
Most studies tend to focus on either volume or shape separately, and few compare both together.
The current organismal systems researchers have used to study how volume and shape influence brain evolution face some challenges, such as lack of genetic divsersity or experimental tools, which makes it difficult to investigate the fundamental principles of how the brain evolves.
The blind Mexican cavefish provides a powerful model for studying genetic variation's impact on brain-wide anatomical evolution due to its distinct surface and cave forms with high genetic diversity.
Hybrid offspring between surface and cave populations allow the exploration of genetic differences and the identification of genetic underpinnings of neuroanatomical evolution.
A brain-wide neuroanatomical atlas was generated for the cavefish, and computational tools were applied to analyze volume and shape changes in brain regions.
Associations between naturally occurring genetic variation and neuroanatomical phenotypes were studied in hybrid brains, revealing genetically-specified regulation of brain-wide anatomical evolution.
Brain regions exhibited covariation in both volume and shape, indicating shared developmental mechanisms causing dorsal contraction and ventral expansion.
Selection may be operating on simple developmental mechanisms that influence early patterning events, modulating the volume and shape of brain regions.
[Methods Overview]
Throughout this VTP, you will...
Here's a flowchart of the analysis you'll perform:
All bioinformatics tasks will be performed in "the cloud" on your own Amazon Web Services (AWS) hosted high performance compute instance.
What is cloud computing:
For this project, you will execute tasks using either Fiji or Python on you cloud computing high-performance compute instance through your browser.
Fiji Fiji is an open-source software package widely used for scientific image analysis and processing. It provides a user-friendly interface and numerous tools for tasks like image filtering, segmentation, and measurement. Fiji is based on another software called ImageJ and includes many plugins and extensions to enhance its functionality. Fiji stands for "Fiji is ImageJ".
Here's what Fiji will look like using the URL we've provided you:
Python Python is a popular programming language known for its simplicity and versatility. It is widely used in data science and scientific research and has extensive support for various scientific libraries and tools. Python provides a flexible and intuitive syntax, making it easier for scientists and researchers to write code for data analysis, machine learning, and image processing tasks.
You will run the python code in a Jupyter Notebook. Jupyter Notebook is an interactive computing environment where users can create and share documents containing live code, visualizations, and explanatory text. It allows for the execution of code in cells, enabling iterative development and easy testing. With support for multiple programming languages, Jupyter Notebook is widely used for data analysis, scientific research, and education. It also facilitates the creation of interactive visualizations and allows documents to be saved and shared in various formats, promoting collaboration and reproducible research.
Here is what the Python (Jupyter Notebook) dashboard looks like:
The URLs and passwords to access Fiji and Python can be found in the Getting Started
section.
Observe, Replicate, and Apply: During this analysis, you'll initially witness the analysis performed on an Example Sample
. Subsequently, replicate that step with the same sample to confirm you achieve identical results. Afterward, apply this procedure to your assigned sample(s) that you can find in the Getting Started
section.
Focus on Understanding, not Coding Syntax: Some steps necessitate the execution of code. However, our primary goal isn't teaching you programming languages but to ensure you grasp the fundamentals of inputs, parameters, outputs, and the interpretation of these outputs. For better comprehension, consider each step as a mathematical function (e.g., y = mx + b
): no matter how complex a block of code appears, you are always inputting data and receiving output.
Mathematics is a Tool, not a Barrier: Certain steps will call for the use of complex mathematical functions to manipulate data. We understand you may not have a deep understanding of the underlying mathematics, but our focus is to understand the purpose and outcome of each step at a high level. Comprehending the input, the general function of the step, the output, and how to interpret and use the output is what's crucial.
Here's a step-by-step tutorial on Image Segmentation and Quantification based on the video "FIJI for Quantification: Cell Segmentation" presented by Dr. Paul McMillan of the Biological Optical Microscopy Platform at the University of Melbourne.
Objective To perform cell segmentation using FIJI/Image J. We will turn a fluorescence image into a segmented image where each individual cell has been individually segmented.
Materials Needed
A computer with FIJI/Image J installed, and the fluorescence image file located on the Desktop in a folder called Segmentation-Tutorial
.
(Reminder: the URL to the cloud instance with FIJI can be found in Getting Started
)
First, watch the video of Dr. McMillan, then perfrom the steps on your assigned Fiji instance.
Steps
In FIJI navigate to "Open > Desktop > Segmentation-Tutorial > Cell.tif"
Duplicate the Image
You should now have two images, "Cell.tif" and the duplicate "Cell-1.tif".
Identify Each Cell on "Cell.tif"
Noise Tolerance
, but in your version of Fiji it's called Prominance
; (2) exact point number you will get is similar to, but not exactly the same as the video due to different versions of Fiji used)Save this as "Mask1.tif".
Define the Area of All the Cells on the duplicated image "Cell-1.tif"
Save this as "Mask2.tif".
Combine "Mask_1.tif" and "Mask_2.tif"
This starts to segment the individual cells within the image.
Clean Up the Image
Save this as "Mask_3.tif".
Analyze the Data
Remember, the specific values used in this tutorial (such as the threshold value of 388 and the filter value of 250 pixels squared) are specific to this analyzed image. When you're doing this with, you might need to adjust these values based on your specific image and what you're trying to achieve. Don't be afraid to experiment and see what works best for your data!
Please upload screenshots of the following into your Google Sheet:
In this training project, you will execute each step with an Example Sample
and then your own assigned sample (Found in the Getting Started
).
[Background/Rationale]
name_of_sample
Manually segment brain regions from a single fish (maybe used stained sample?)
Step 1: Step 2: Step 3:
Video
Manually segment brain regions from a single fish (maybe used stained sample?)
no video
Takeaways:
In the Google Sheet, please:
X
Y
Z
Linux is a command-line based operating system. For our purposes Linux = Unix.
The next step of this project is performed in Linux. Linux is a powerful operating system used by many scientists and engineers to analyze data. Linux commands are written on the command line
, which means that you tell the computer what you want it to do by typing, not by pointing and clicking like you do in operating systems like Windows, Chrome, iOS, or macOS. The commands themselves are written in a language called BaSH.
Here's a brief tutorial that covers some concepts and common commands using a sample code.
Concepts
Command Line
The Linux command line is a way of interacting with the computer's operating system by typing in commands in a text-based interface. It's a bit like talking to your computer using a special language, and that language is called Bash.
The command line in the Terminal tab should look something like this:
(base) your-user#@your-AWS-IP-address:~
$
(base) your-user#@your-AWS-IP-address:~
$ linux-commands-are-typed here
(base) your-user#@your-AWS-IP-address:~ $
Using the command line is different than using a graphical user interface (GUI) because instead of clicking on icons and buttons to interact with your computer, you type in text commands. It can take some time to learn the commands and syntax of the command line, but once you do, you can often do things faster and more efficiently than with a GUI. Some people also prefer using the command line because it gives them more control over their computer's settings and can be more powerful for certain tasks.
Let's type a command in the command line.
List the files and directories in the current directory using the ls
command:
ls
You should see something like this output to the screen:
It looks like a lot, but don't fret, this is just a bunch of files and directories (i.e. folders). You'll notice the Terminal colors by their type. For example, directories are colored blue. A directory is computer-science speak for a folder. So this tells you what folder you are currently sitting in.
Directory
As we have mentioned, a directory
is another name for a folder in which files are stored.
If you look immediately to the left of the $, you will see what is called the "working directory". The ~
symbol has a special meaning in Linux systems: it refers to the home directory on your system.
Navigate to the CMTK_Analyses
directory using the cd command.
cd ~/CMTK_Analyses/
After you execute the command, your command line should look like this:
(base) user2@ip-172-31-25-174:~/reads
$
Now, use the ls
command to list the files and directories in this directory:
ls
You should see something like this`:
Now, let's say you need to go back to the home ~
directory:
cd ~
You should once again be back in the home ~
directory. It should look like this:
(base) your-user#@your-AWS-IP-address:~
$
Now, go back to the CMTK_Analyses
directory.
cd CMTK_Analyses
Sometime we need to create a new directory to store files around a project, like when you create a new folder on your computer.
Let's create a new directory called test_directory
in the current directory using the mkdir
command.
mkdir test_directory
Execute ls
to confirm that the test_directory
directory was indeed created.
Enter the test_directory
directory.
cd test_directory
Your terminal should look like this:
(base) user2@ip-172-31-25-174:~/CMTK_Analyses/test_directory
$
Return to the ~/CMTK_Analyses
directory:
cd ~/CMTK_Analyses
List the last last 15 commands run in your terminal so you can submit it in the next checkpoint:
history 15
There's a lot more that we could cover about working with the Linux command line, but this is enough to get started with your bioinformatics analysis.
In the Google Sheet, please:
X
Y
Z
[Background/Rationale]
Run CMTK Registration
name_of_sample
Run CMTK on example sample
Video
Step 1: Step 2: Step 3:
The output should look like this:
Run CMTK on your assigned sample.
In the Google Sheet, please:
X
Y
Z
[Background/rationale]
name_of_sample
Generate volumetric data by region on Example Sample using CMTK or CobraZ (?)
Generate volumetric data by region on Assigned Sample using CMTK or CobraZ (?)
In the Google Sheet, please:
X
Y
Z
For the next set of steps, we'll use the Python Programming Language.
Python Programming Tutorial for High School Students
Part 1: Introduction to Python
Python is a high-level, interpreted, and general-purpose dynamic programming language that focuses on code readability. The syntax in Python helps the programmers to do coding in fewer steps as compared to Java or C++.
Part 2: Basics of Python
2.1 Python Variables and Data Types
In Python, variables are created when you assign a value to it. Python has various data types including numbers (integer, float, complex), string, list, tuple, and dictionary.
# Defining variables in Python
a = 10 # integer
b = 5.5 # float
c = 'Hello World' # string
2.2 Python Operators
There are various operators in Python such as arithmetic operators (+, -, , /, %, *, //), comparison operators (==, !=, >, <, >=, <=), and logical operators (and, or, not).
# Python operators
a = 10
b = 20
print(a + b) # output: 30
print(a > b) # output: False
Part 3: Python Conditional Statements and Loops
3.1 If Else Statement
Python supports the usual logical conditions from mathematics. These can be used in several ways, most commonly in "if statements" and loops.
# Python if else statement
a = 10
b = 20
if a > b:
print("a is greater than b")
elif a == b:
print("a is equal to b")
else:
print("a is less than b")
3.2 For Loop
A for loop is used for iterating over a sequence (that is either a list, a tuple, a dictionary, a set, or a string).
# Python for loop
fruits = ["apple", "banana", "cherry"]
for x in fruits:
print(x)
3.3 While Loop
With the while loop we can execute a set of statements as long as a condition is true.
# Python while loop
i = 1
while i < 6:
print(i)
i += 1
Part 4: Python Functions
A function is a block of code which only runs when it is called. You can pass data, known as parameters, into a function. A function can return data as a result.
# Python function
def my_function():
print("Hello from a function")
my_function() # calling the function
In the Google Sheet, please:
X
Y
Z
[Background/Rationale]
Optic Tectum
Create violin plots to compare volumetric distributions of brain regions in Surface vs. SPF2 fish.
The example sample is the Optic Tectum, or TeO for short.
Open 2_brain-volume-distribution.ipynb
and run these steps:
import pandas as pd
from scipy.stats import ttest_ind
import matplotlib.pyplot as plt
import seaborn as sns
# Load the new data
teo_data = pd.read_csv('./TeO.csv')
# Display the first few rows of the dataframe
teo_data.head()
# Create a new 'Category' column based on the 'File' column
teo_data['Category'] = teo_data['File'].apply(lambda x: 'Surface' if 'Surface' in x else 'SPF2')
# Perform a two-sample Student's t-test for 'TeO'
surface_data = teo_data[teo_data['Category'] == 'Surface']['Sum']
spf2_data = teo_data[teo_data['Category'] == 'SPF2']['Sum']
t_stat, p_val = ttest_ind(surface_data, spf2_data)
print(t_stat)
print(p_val)
# Create a mapping of p-value to significance indicator
if p_val < 0.0001:
sig_indicator = '****'
elif p_val < 0.001:
sig_indicator = '***'
elif p_val < 0.01:
sig_indicator = '**'
elif p_val < 0.05:
sig_indicator = '*'
else:
sig_indicator = ''
# Create the violin plot (without points, p-value, title, y-axis label)
plt.figure(figsize=(10, 6))
sns.violinplot(x='Category', y='Sum', data=teo_data, inner=None, palette="pastel")
# Create the violin plot (with points)
plt.figure(figsize=(10, 6))
sns.violinplot(x='Category', y='Sum', data=teo_data, inner=None, palette="pastel")
# Overlay the datapoints
for category in ['Surface', 'SPF2']:
category_data = teo_data[teo_data['Category'] == category]
plt.plot([0 if category == 'Surface' else 1]*len(category_data), category_data['Sum'], 'k.', markersize=5)
# Show the plot
plt.show()
# Create the violin plot (with points, p-value, title, y-axis label)
plt.figure(figsize=(10, 6))
sns.violinplot(x='Category', y='Sum', data=teo_data, inner=None, palette="pastel")
# Overlay the datapoints
for category in ['Surface', 'SPF2']:
category_data = teo_data[teo_data['Category'] == category]
plt.plot([0 if category == 'Surface' else 1]*len(category_data), category_data['Sum'], 'k.', markersize=5)
# Add a horizontal line and significance indicator if the p-value is significant
if sig_indicator:
ymax = teo_data['Sum'].max()
plt.plot([0, 1], [ymax + 0.1*ymax, ymax + 0.1*ymax], 'k-')
plt.text(0.5, ymax + 0.12*ymax, sig_indicator, ha='center', fontsize=14)
plt.ylim(-0.1*ymax, ymax + 0.2*ymax)
# Set the title and labels of the plot
plt.title('TeO')
plt.ylabel('% of Total Brain Volume')
# Show the plot
plt.show()
Now recreate this analysis for your assigned sample. In the template script below,
Replace 'Your_File.csv' with the name of your data file.
replace your_sample_data
with the name of your sample. for example, if you're assigned the Tegmentum (Tg
, your_sample_data
would be changed to tg_data
# Load the new data
your_sample_data = pd.read_csv('./your_sample.csv')
# Display the first few rows of the dataframe
print(your_sample_data.head())
# Create a new 'Category' column based on the 'File' column
your_sample_data['Category'] = your_sample_data['File'].apply(lambda x: 'Surface' if 'Surface' in x else 'SPF2')
# Perform a two-sample Student's t-test for your_sample
surface_data = your_sample_data[your_sample_data['Category'] == 'Surface']['Sum']
spf2_data = your_sample_data[your_sample_data['Category'] == 'SPF2']['Sum']
t_stat, p_val = ttest_ind(surface_data, spf2_data)
print(t_stat)
print(p_val)
# Create a mapping of p-value to significance indicator
if p_val < 0.0001:
sig_indicator = '****'
elif p_val < 0.001:
sig_indicator = '***'
elif p_val < 0.01:
sig_indicator = '**'
elif p_val < 0.05:
sig_indicator = '*'
else:
sig_indicator = ''
# Create the violin plot (without points, p-value, title, y-axis label)
plt.figure(figsize=(10, 6))
sns.violinplot(x='Category', y='Sum', data=your_sample_data, inner=None, palette="pastel")
# Create the violin plot (with points, p-value, title, y-axis label)
plt.figure(figsize=(10, 6))
sns.violinplot(x='Category', y='Sum', data=your_sample_data, inner=None, palette="pastel")
# Overlay the datapoints
for category in ['Surface', 'SPF2']:
category_data = your_sample_data[your_sample_data['Category'] == category]
plt.plot([0 if category == 'Surface' else 1]*len(category_data), category_data['Sum'], 'k.', markersize=5)
# Add a horizontal line and significance indicator if the p-value is significant
if sig_indicator:
ymax = your_sample_data['Sum'].max()
plt.plot([0, 1], [ymax + 0.1*ymax, ymax + 0.1*ymax], 'k-')
plt.text(0.5, ymax + 0.12*ymax, sig_indicator, ha='center', fontsize=14)
plt.ylim(-0.1*ymax, ymax + 0.2*ymax)
# Set the title and labels of the plot
plt.title(your_sample_file.replace('.csv', '')) # Use the filename as the title, minus the .csv extension
plt.ylabel('% of Total Brain Volume')
# Show the plot
plt.show()
In the Google Sheet, please:
X
Y
Z
[Background/rationale]
Execute step with n = 5 Example Sample Dataset
# Import libraries that we'll need
import pandas as pd
import numpy as np
import scipy.cluster.hierarchy as spc
import matplotlib.pyplot as plt
from matplotlib import rcParams
import seaborn as sns
# Import data, and drop columns that we do not need. the result will be a matrix with each of 180 neuroanatomcial regions
# as columens and each row as a fish.
file = 'ExampleSample_5-F2s.xlsx'
# file = 'out_region_size_SPF2_HB_macro_v3.xlsx'
df_small_example = pd.read_excel(file)
df_small_example.set_index('File')
df_small_example_final = df_small_example.drop(['Unnamed: 0', 'File', 'BrainSize', 'NucSize', 'SypSize'],axis=1)
df_small_example_final.head(5)
# Import data, and drop columns that we do not need. the result will be a matrix with each of 180 neuroanatomcial regions
# as columens and each row as a fish.
file = 'ExampleSample_5-F2s.xlsx'
# file = 'out_region_size_SPF2_HB_macro_v3.xlsx'
df_small_example = pd.read_excel(file)
df_small_example.set_index('File')
df_small_example_final = df_small_example.drop(['Unnamed: 0', 'File', 'BrainSize', 'NucSize', 'SypSize'],axis=1)
df_small_example_final.head(5)
# Generate a heat map with correaltions for all 180 regions. This is unclustered
corr_mat_df_small_example = df_small_example_final.corr().to_numpy()
corr_mat_df_small_example[np.isnan(corr_mat_df_small_example)] = 0 # casting nan values to zero correlation
rcParams['figure.figsize'] = 20,20
sns.heatmap(corr_mat_df_small_example)
plt.show()
## Perform clustering
# Cluster
pdist_small_example = spc.distance.pdist(corr_mat_df_small_example)
linkage_small_example = spc.linkage(pdist_small_example, method='complete')
idx_small_example = spc.fcluster(linkage_small_example, 0.5 * pdist.max(), 'distance')
# print(idx_small_example)
# Get cluster vector
cluster_vector_small_example=np.concatenate((np.argwhere(idx_small_example==1), np.argwhere(idx_small_example==2),np.argwhere(idx_small_example==3),np.argwhere(idx_small_example==4),np.argwhere(idx_small_example==5),np.argwhere(idx_small_example==6),
np.argwhere(idx_small_example==7), np.argwhere(idx_small_example==8), np.argwhere(idx_small_example==9), np.argwhere(idx_small_example==10), np.argwhere(idx_small_example==11), np.argwhere(idx_small_example==12)
))
# print(np.shape(cluster_vector_small))
# Restructure correlation matrix
corr_mat_df_small_example=np.copy(orr_mat_df_small_example)
# print(np.shape(corr_mat))
for i in range(len(corr_mat_df_small_example)):
for j in range(len(corr_mat_df_small_example)):
corr_mat_df_small_example[i,j] = corr_mat_df_small_example[cluster_vector_small_example[i],cluster_vector_small_example[j]]
# Plot
sns.heatmap(corr_mat_df_small_example)
plt.show()
Execute step with n = 5 Example Sample Dataset
Execute step with n = 20 Example Sample Dataset
Execute step with n = 20 Your Sample Dataset
Execute with All Samples.
# Import data, and drop columns that we do not need. the result will be a matrix with each of 180 neuroanatomcial regions
# as columens and each row as a fish.
file = 'out_region_size_HB_macro.xlsx'
# file = 'out_region_size_SPF2_HB_macro_v3.xlsx'
df = pd.read_excel(file)
df.set_index('File')
df2 = df.drop(['Unnamed: 0', 'File', 'BrainSize', 'NucSize', 'SypSize'],axis=1)
df2.head(5)
# Generate a heat map with correaltions for all 180 regions. This is unclustered
corr_mat = df2.corr().to_numpy()
corr_mat[np.isnan(corr_mat)] = 0 # casting nan values to zero correlation
rcParams['figure.figsize'] = 20,20
sns.heatmap(corr_mat)
plt.show()
## Perform clustering
# Cluster
pdist = spc.distance.pdist(corr_mat)
linkage = spc.linkage(pdist, method='complete')
idx = spc.fcluster(linkage, 0.5 * pdist.max(), 'distance')
# print(idx)
# Get cluster vector
cluster_vector=np.concatenate((np.argwhere(idx==1), np.argwhere(idx==2),np.argwhere(idx==3),np.argwhere(idx==4),np.argwhere(idx==5),np.argwhere(idx==6),
np.argwhere(idx==7), np.argwhere(idx==8), np.argwhere(idx==9), np.argwhere(idx==10), np.argwhere(idx==11), np.argwhere(idx==12)
))
# print(np.shape(cluster_vector))
# Restructure correlation matrix
corr_mat_clustered=np.copy(corr_mat)
# print(np.shape(corr_mat))
for i in range(len(corr_mat)):
for j in range(len(corr_mat)):
corr_mat_clustered[i,j] = corr_mat[cluster_vector[i],cluster_vector[j]]
In the Google Sheet, please:
X
Y
Z
So, at this point,
Please complete the Post-VTP Survey.