Project #1 - January 2022

Group members: Amir Mohammad Saghiran Hatami - Hadis Rezaei - Shuvangi Benerjee

# install BiocManager if not exist
if (!requireNamespace("BiocManager", quietly = T))
    install.packages("BiocManager",quiet = T)  

# install edgeR if not exist
if (!requireNamespace("edgeR", quietly = T))
    BiocManager::install("edgeR", quiet = T)

# install VennDiagram if not exist
if (!requireNamespace("VennDiagram", quietly = T))
    install.packages("VennDiagram", quiet = T)

Load “edgeR” and “VennDiagram” library to use in the project.

library(edgeR, quietly = T)
library(VennDiagram, quietly = T)

Common part:

Task #1: Read the data into a DGEList object.

Solution: We read the text files that contain human gene expression profiling RNA-Seq datasets composed of 60 samples from 10 human organs/tissues and store in a data frame. We should import 2 files. – Counts.csv: A table containing gene expression values (read counts) for the 28.188 human genes in the 60 replicates – Design.csv: a table containing the experimental design of the RNAseq (i.e the tissue, individual, and sex associated with each biological replicate)

# First we read and load the data into the data frames
design=read.table("design.csv",header=T,row.names=1,sep="\t")
Importeddesign=design
Importedcounts=read.table("counts.csv",header=T,row.names=1,sep="\t", check.names=FALSE)

Two tables must be consistent in order of rows/column names.

# The row names of "design" data frames are not consistent with the column names of the "counts" data frames, so we reorder the columns of the "counts" (replicates) based on the row names of the "design" (replicates).
sortby = rownames(design)
counts = Importedcounts[,sortby]

# To check if replicates are correlated within two data frames
rownames(design) == colnames(counts)

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Task #2: keep only genes that are likely to be expressed (i.e genes that have more than 10 reads in at least 1 replicate).

Solution: We filter out the genes that are not expressed, by checking if cells in the data frame have a count >=10 and how many times genes (rows) have a count of 10 or greater.

# A logical test for cells of the "counts" matrix
EXP=counts>=10

# Summation of the TRUE values to have an estimation of the replicates expressing each gene
NumExp=apply(EXP,1,sum)

# Select only the genes that have a count of 10 or greater at least once!
FiltCounts=counts[NumExp>0,]

Task #3: Perform normalization with calcNormFactors()

Solution: We should create a DGElist from “counts” data frame grouped by tissue types. Then we need to use the calcNormFactors function to normalize the data. This function adds a new column to the DGElist. This column is called norm.factors (scaling factors) that will be used to multiply with each gene’s read count to normalize library size. By default, TMM method implements the trimmed mean of M-values method.

# To use calcNormFactors() we need to a DGE object
y=DGEList(counts=FiltCounts,group=design$tissue)

# Normalization factor calculation by calcNormFactors function
y=calcNormFactors(y)

What’s the logic:

Based on the manual obtained by “?calcNormFactors”, for the downstream analysis lib.size will be multiplied by norm.factors. It means that for library one, “GTEX-Y5V6-0526-SM-4VBRV”, the reads for each gene in the library will be multiplied by 1.276983 and so on. The logic is, when norm.factors is higher than one, it means that the library has a little bit fewer reads than the others and will be increased.

# Let inspect it:
head(y$samples)

Task #4: Perform a MDS (~PCA) plot of the data

Solution: We perform Principal Component Analysis to examine if the data is clustered as we expect. MDS (Multidimensional Analysis plot) is a PCA-like technique which can be used to see how well our expression data (in particular the replicates) recapitulates the biological conditions under study. To better illustrate we show the “Tissues” in different colors. Since we have 10 tissues, we specify 10 colors. In this plot we have 2 principle components. The first component (x-axis) explains 21 percent of the data variability and the second component (y-axis) explains 15 percent of the data variability.

# PCA plot with plotMDS function by defining of the colors, values, and lables, respectively.
plotMDS(y,col=rep(c("red","green","blue","yellow","black","purple","pink","orange","gray","brown"),table(design$tissue)), labels = design$tissue, cex = 0.5, main="Principal Component Analysis of 60 replicates from 10 tissues")

Task #5: Select 2 different (and meaningful) biological conditions and perform a differential expression analysis using the exactTest() function

Solution: We should use esimateDisp to compute the Negative Binomial on the data. esimateDisp for every gene estimates its variability and average expression to compute the p-values. The more data (replicates or reads) leads to a more accurate model. This function adds common.dispersion, trended.dispersion, and tagwise.dispersion classes into the DGElist. We apply this function on the DGElist class with Tagwise model (as the default setting) to estimate variability for all the genes, gene by gene. Then we can do a differential expression test with exacTest output. The parameter pair is used to provide the pair of conditions that we are going to compare. This function adds DGEExact class in the DGElist. We select the Muscle-Skeletal and Pancreas tissue due to their nice separation in the PCA plot.

# Negative Binomial calculation of the data
y=estimateDisp(y)

## Using classic mode.

# Variability estimation of DGElist class data for all the genes using tagwise model as the default setting
Muscle_Pancreas=exactTest(y,pair =c("Muscle - Skeletal","Pancreas"))

If we inspect the table class of the DEGlist, we will see the logFC, logCPM, and PValue columns. The lower the p-value is, the more likely the gene is differentially expressed

head(Muscle_Pancreas$table)

Task #6: Create a topTags type of edgeR object containing the list of differentially expressed genes (DEGs) [DEGs should have a FDR <= 0.01].

Solution: We use topTags to extract the Differentially Expressed Genes (DEG). topTags extracts the most DEG (or sequence tags) from a test object, ranked either by p-value or by absolute log-fold-change. topTags function applies Benjamini Hochberg by default. We set n to the total number of genes (rows) within the “Muscle_Pancreas” object. topTags adds a column named FDR which stores relative FDR value for each gene.

LogFC = LogFoldChange of the expression between two condition

LogCPM = Log Count per Million that is the average expression

P-Value = P-Value of differentially expressed genes between two conditions (meaning that the expression was not the same for that gene)

FDR = FalseDiscoveryRate, the parameter that we use to say the if the difference is significant (<= 0.01)

Muscle_PancreasDE=topTags(Muscle_Pancreas,n=nrow(Muscle_Pancreas),p.value = 0.01)
dim(Muscle_PancreasDE)

## [1] 10644     4

head(Muscle_PancreasDE$table)

Task #7: Assign all the genes into one of the 4 possible classes:

- DE_UP (FDR<=0.01 and logFC>0),

- DE_DOWN (FDR<=0.01 and logFC<0),

- notDE_UP (FDR>0.01 and logFC>0),

- notDE_DOWN (FDR>0.01 and logFC<0),

Solution: We need to get the complete (all the genes) results from our topTags object. This can be done by rerunning the topTags function without the P-Value filtration

Muscle_PancreasTop=topTags(Muscle_Pancreas,n=nrow(Muscle_Pancreas))

# We can see the number of genes are increased compared with "Muscle_PancreasDE"
dim(Muscle_PancreasTop)

## [1] 20395     4

# We categorize the genes into four classes defined by the criteria in the question based on the (FDR & logFC) columns in the $table object. 
dataMuscle_Pancreas=Muscle_PancreasTop$table
UPde=dataMuscle_Pancreas[dataMuscle_Pancreas$FDR<=0.01 & dataMuscle_Pancreas$logFC>0,"logFC"]
UPnoDE=dataMuscle_Pancreas[dataMuscle_Pancreas$FDR>0.01 & dataMuscle_Pancreas$logFC>0,"logFC"]
DOWNde=dataMuscle_Pancreas[dataMuscle_Pancreas$FDR<=0.01 & dataMuscle_Pancreas$logFC<0,"logFC"]
DOWNnoDE=dataMuscle_Pancreas[dataMuscle_Pancreas$FDR>0.01 & dataMuscle_Pancreas$logFC<0,"logFC"]

sprintf("Up Regulated [Differentially Expressed] Genes = %i", length(UPde))

## [1] "Up Regulated [Differentially Expressed] Genes = 5770"

sprintf("Up Regulated [Not Differentially Expressed] Genes = %i", length(UPnoDE))

## [1] "Up Regulated [Not Differentially Expressed] Genes = 5340"

sprintf("Down Regulated [Differentially Expressed] Genes = %i", length(DOWNde))

## [1] "Down Regulated [Differentially Expressed] Genes = 4874"

sprintf("Down Regulated [Not Differentially Expressed] Genes = %i", length(DOWNnoDE))

## [1] "Down Regulated [Not Differentially Expressed] Genes = 4145"

- Then, do a boxplot of the logFC of the genes belonging to each class

Solution: Boxplot is probably the most commonly used chart type to compare distribution of several groups. Box plots divide the data into sections that each contain approximately 25% of the data in that set. Box plots are useful as they provide a visual summary of the data enabling researchers to quickly identify mean values, the dispersion of the data set, and signs of skewness.

boxplot(UPde,UPnoDE,DOWNde,DOWNnoDE,outline=FALSE,col=rainbow(4),
        ylab="logFC score", names=c("DE_UP","notDE_UP","DE_DOWN","notDE_DOWN"), 
        main="Comparison of logFC distributions", xlab="Four possible classes of genes")

Project 1:

Task #1: Identify genes showing sex specific expression in the 2 tissues that you considered for the “common part”.

- Perform a PCA (principal component analysis) to ascertain whether there is separation between biological replicates of different sexes (for every tissue you have 3 individuals of sex 1 and 3 of sex 2)

Solution: The same as the common part, we need to filter out the not expressed genes, correlate two data frames, and normalize the read counts for these tissues (Pancreas and Muscle-Skeletal). Then we can plot a PCA for them.

# Filter selected tissues, add a new column with the values of two columns to make the plot more informative (gender-tissue pairs are separated), then sort by the new column.
design=Importeddesign[Importeddesign$tissue == "Muscle - Skeletal" | Importeddesign$tissue == "Pancreas", ]
design$Mixed = paste(design$sex,design$organ)
design=design[ order(design$Mixed), ]

# Store the row names of the design table to correlate design table with count table
replicatestoKeep = rownames(design)
counts = subset(Importedcounts, TRUE, replicatestoKeep)
rownames(design) == colnames(counts)

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

# Filter out low expressed genes
EXP = counts>=10

# how many times genes (rows) have a count of 10 or greater?
NumExp = apply(EXP,1,sum)

# select only the genes that a count of 10 or greater at least once!
FiltCounts = counts[NumExp>0,]

# Perform a PCA to assure whether there is separation between biological replicates of different sexes
y = DGEList(counts=FiltCounts,group=design$Mixed)
y = calcNormFactors(y)

plotMDS(y,col=rep(c("red","green","blue","black"),table(design$Mixed)), labels = design$Mixed, cex = 0.5, main="PCA of 12 biological replicates of different sexes")

We can see clearly that replicates of each tissue are separated by the gender of the individual.

− For each tissue, consider only genes expressed (>10 reads in at least 2 replicates) in that tissue.

− Use edgeR (exactTest) to perform a differential expression analysis

− Consider all the genes that show a FDR <= 0.05 as “sex specific” DEGs

Solution: We need to create a different data frame for each tissue containing all the expressed genes in that tissue. For this goal we defined a function (Xfunc) to prevent the code clutter.

# List the desire tissues in a vector
Tissues = c("Muscle - Skeletal","Pancreas")

# define a function to prevent write it more than one time
Xfunc <- function(TISSUE) {
  # We need to separate tissues into different data frames to count and check genes
  Xdesign=design[design$tissue==TISSUE , ]
  replicatestoKeep=rownames(Xdesign)
  Xcounts=subset(counts, TRUE, replicatestoKeep)
  # For each tissue, consider only genes expressed (>10 reads in at least 2 replicates) in that tissue
  XEXP=Xcounts>=10
  # how many times genes (rows) have a count of 10 or greater?
  XNumExp=apply(XEXP,1,sum)
  # select only the genes that a count of 10 or greater at least twice in that tissue!
  XfilteredData=Xcounts[XNumExp>=2,]
  # Creating the DGE object
  y=DGEList(counts=XfilteredData,group=Xdesign$sex)
  # Library Size Normalization
  y=calcNormFactors(y)
  # Differential expression analysis
  y=estimateDisp(y)
  XGenders=exactTest(y,pair =c("1","2"))
  # Sex specific DEGs: Genes differentially expressed between two genders by FDR <= 0.05
  XDE=topTags(XGenders,n=nrow(XGenders),p.value = 0.05)
  return(XDE)
}

# Now we can utilize the "XFunc" function to get the list of DEGs showing sex specific expression in each tissue, separately.
MuscleDE = Xfunc(Tissues[1])

## Using classic mode.

PancreasDE = Xfunc(Tissues[2])

## Using classic mode.

Task #2: Draw a Venn Diagram of the Sex specific genes between the 2 tissues

Solution: For this goal, we need to create a list by putting together two lists of DEGs to plot them in the same Venn diagram.

# Create list of differential expressed genes
listDE=list(rownames(MuscleDE),rownames(PancreasDE))

# Plot using the venn.diagram function and save the plot into a file. Use the filename 
# parameter to set a sensible name
venn.diagram(listDE,category.names = c("Muscle - Skeletal","Pancreas"),filename = "Venn_Diagram_Sex_Specific_Genes.png",imagetype = "png",main="Venn Diagram of Sex Specific DE genes in Skeletal Muscle and Pancreas",col=c("red","blue"), margin=0.05, ext.text=FALSE)

## [1] 1

knitr::include_graphics("Venn_Diagram_Sex_Specific_Genes.png")

Venn Diagram of DE genes in Skeletal Muscle and Pancreas

- How many genes are sex specific in both tissues?

369+29+16 = 414 genes are sex-specific in at least one tissue. 16 genes are sex-specific in both tissues.

Task #3: Finally draw a Venn Diagram between the DEGs (as identified in the common part) and genes showing Sex-specific expression in at least one of the tissues considered.

Solution: We could either combine Sex-specific genes expressed in at least one of two tissues into one list and perform the venn diagram with the DEGs from common part, or each tissue separately to have a more detailed plot.

# Create a list of the three gene sets 
listDE=list(rownames(MuscleDE), rownames(PancreasDE), rownames(Muscle_PancreasDE))
plt=venn.diagram(listDE,category.names = c("Muscles Sex Specific", "Pancreas Sex Specific", "Muscles Pancreas DEGs"),filename = "DEGs_vs_Sex_specific.png",imagetype = "png",main="DE genes in Muscle and Pancreas sex specific (FDR<=0.05) and DEGs (FDR<=0.01)",col=c("red","blue","green"), margin=0.05)

knitr::include_graphics("DEGs_vs_Sex_specific.png")

Venn Diagram of DE genes in Muscle and Pancreas sex specific and DEGs

In Muscle (red circle), 116 genes are only differentially expressed between two sexes (sex-specific, not tissue-specific) while 15 genes are sex-specific in both Muscle and Pancreas tissues (not tissue-specific).

254 genes are both sex-specific in Muscles and differentially expressed between two tissues. But, as the names of the 253 genes out of these 254 genes are removed during the filtering of not expressed genes in the Pancreas, they are only shared between the green and red circles (not with blue). The same logic applies for 20 genes of Pancreas.

On the other hand, 10370 genes are differentially expressed due to any condition variation between samples but not related to gender (such as age or epigenomic state). It should be noted that during these two analyses (Common part and Project 1) we used different FDR thresholds that results in the change of the number of genes in the Venn Diagram.

- How many genes that are DE between the 2 tissues are also DE between the 2 sexes?

253+1+20=274

- Do you expect to see many? Why?

No, because many other factors besides gender contribute to the change in the expression level of a gene.

Additional objective: What is this single gene?

# There is a function called "intersect" that allows to find the common items between list 
intersect(intersect(rownames(MuscleDE), rownames(PancreasDE)),rownames(Muscle_PancreasDE))

## [1] "ENSG00000055955"

About the one gene shared between three circles, this is the gene that is expressed in both tissues but with significantly different expression levels. This gene is also sex-specific. The “ENSG00000055955” gene is named “inter-alpha-trypsin inhibitor heavy chain 4 (ITIH4)”, which is mainly expressed (exclusive expression) in the liver and in much lower extent in Pancreas but still higher than Muscles. The FANTOM5 data set confirms that this gene is expressed in both Pancreas and Muscles with a higher level than the threshold but not with the same expression levels. [ref:https://www.proteinatlas.org/ENSG00000055955-ITIH4/tissue]