signallingsinglecell.Rmd

---
title: "SignallingSingleCell"
geometry: margin=0.5cm
output:
  html_document:
    toc: true
    toc_float : yes
    toc_depth : 4
---

```{r setup, include=FALSE, warning=FALSE, error=FALSE}
knitr::opts_chunk$set(echo = TRUE, comment = "")
```

SignallingSingleCell is an R package that provides an end-to-end approach for the analysis of scRNA-seq data, with a particular focus on building ligand and receptor signaling networks.

<img align="center" src="./images/ssc_workflow.png">

# SignallingSingleCell Package 

The workflow of SignallingSingleCell consists of a few key steps; 

* filtering
* gene selection
* dimension reduction
* clustering
* differential expression
* network analysis.

Additionally, the package has a variety of features that were built in an ad hoc manner. These features include algorithms for the detection of marker genes that are specific to each cell type, aggregation functions used to construct in silico bulk data, and a supervised method of scRNA-seq analysis in order to contextualize the scRNA-seq data with commonly used immunological markers. 

More information on the SignallingSingleCell package and example vignette can be found [here](https://github.com/garber-lab/SignallingSingleCell/tree/master/vignettes).

# Human Vitiligo Data

We performed scRNA data analysis using SignallingSingleCell on affected and unaffected skin from vitiligo patients, as well as healthy controls, to define the role of each cell type in coordinating autoimmunity during disease progression. 

## Data Description

The data was generated using suction blistering followed by scRNA-seq on 10 subjects with active vitiligo (treatment-naïve for at least 6 months) and 7 healthy individuals. For each vitiligo subject, we collected 2 sets of blisters, one lesional sample from affected skin and one non-lesional sample from unaffected skin, located at least 10cm away from any visible lesion. 

You can find more details in, 

https://www.science.org/doi/10.1126/scitranslmed.abd8995

For convenience we are providing 3 processed data tables here. One is a raw sparse matrix, that is the output of our processing pipeline with no further manipulation. We are also providing a fully processed UMI table that has been filtered, dimension reduced, clustered, and normalized. The processed UMI table is provided in the expression set class format  (https://www.bioconductor.org/packages/devel/bioc/vignettes/Biobase/inst/doc/ExpressionSetIntroduction.pdf). We are also providing a separate processed UMI table for the T Cells.

The raw .fastq files have been deposited through dbGAP.

https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs002455.v1.p1

These files are protected, and you will need an account and to request access to download the files. As of 10/5/21 the data files are processing on the dbGAP server.

### Download Links

```{r error=FALSE, message=FALSE, warning=FALSE, cache=FALSE, include=FALSE}
# devtools::install_github("garber-lab/SignallingSingleCell")
library(SignallingSingleCell)
library(kernlab)
# load(url("https://www.dropbox.com/s/y967oe6vwk1jue2/phs002455.v1.p1_UMITable.Rdata?dl=1")) # The Raw Data
load(url("https://www.dropbox.com/s/s49jt231j9gib5u/phs002455_processed_UMItable.Rdata?dl=1"))
load(url("https://www.dropbox.com/s/clcd7oref1yil5x/phs002455_TCells_processed_UMItable.Rdata?dl=1"))
```

```{r eval=FALSE, echo=TRUE}
# devtools::install_github("garber-lab/SignallingSingleCell")
library(SignallingSingleCell)
library(kernlab)
# load(url("https://vitiligo.dolphinnext.com/data/phs002455.v1.p1_UMITable.Rdata")) # The Raw Data
load(url("https://vitiligo.dolphinnext.com/data/phs002455_processed_UMItable.Rdata"))
load(url("https://vitiligo.dolphinnext.com/data/phs002455_TCells_processed_UMItable.Rdata"))
```

## Data Exploration  

### The ExpressionSet Class  

The ExpressionSet class (ex_sc) is a convenient data structure that contains 3 dataframes. These dataframes contain expression data, cell information, and gene information respectively. 

exprs(ex_sc) is the expression data, where rows are genes and columns are cells. 

pData(ex_sc) is cell information, where rows are cells and columns are metadata.

fData(ex_sc) is gene information, where rows are genes and columns are metadata.

```{r, include=TRUE, cache=FALSE, warning=FALSE, error=FALSE}
colnames(pData(phs002455_processed_UMItable))
# Columns 1-4 are all phenotypic metadata related to the patient sample
# Columns 5-12 are all related to cell specific metadata such as tSNE coordinates, Clusters, etc
colnames(fData(phs002455_processed_UMItable))
# The only thing stored in fData right now are cluster marker scores.
# These were calculated with id_markers()
```

### Figure 1B

Reproduce Figure 1B from https://www.science.org/doi/10.1126/scitranslmed.abd8995

```{r, include=TRUE, cache=FALSE, warning=FALSE, error=FALSE, fig.height = 8, fig.width = 10}
plot_tsne_metadata(phs002455_processed_UMItable, color_by = "Cluster_Fig_1B", shuffle = T)
plot_tsne_gene(phs002455_processed_UMItable, gene = c("TRAC", "TYR"))
# Additional plots
plot_violin(phs002455_processed_UMItable, gene = c("IFNG"), color_by = "Skin", facet_by = "Cluster_Refined")
markers <- return_markers(phs002455_processed_UMItable, return_by = "Cluster_Refined", num_markers = 5)
markers
plot_gene_dots(phs002455_processed_UMItable, genes = unique(unlist(markers)), break_by = "Cluster_Refined")
```

### Figure 1C

Reproduce Figure 1C from https://www.science.org/doi/10.1126/scitranslmed.abd8995

```{r, include=TRUE, cache=FALSE, warning=FALSE, error=FALSE, fig.height = 8, fig.width = 10}
plot_tsne_metadata(phs002455_TCells_processed_UMItable, color_by = "Cluster", shuffle = T)
plot_tsne_gene(phs002455_TCells_processed_UMItable, gene = c("TRAC", "CD4", "CD8A", "FOXP3", "TRGC1", "FCER1G"))
# To create the aggregate bulk heatmaps first calculate the aggregate bulk values
phs002455_TCells_processed_UMItable <- calc_agg_bulk(phs002455_TCells_processed_UMItable, aggregate_by = "Cluster")
# Then identify markers
phs002455_TCells_processed_UMItable <- id_markers(phs002455_TCells_processed_UMItable, id_by = "Cluster", overwrite = T)
markers <- return_markers(phs002455_TCells_processed_UMItable)
markers
markers <- unique(unlist(markers))
```

Now create a heatmap using these values

```{r, include=TRUE, cache=FALSE, warning=FALSE, error=FALSE, fig.height = 12, fig.width = 6}
plot_heatmap(phs002455_TCells_processed_UMItable, type = "bulk", genes = markers)
```

## Basic Network Analysis

### Annotate Ligands and Receptors

These functions are still in development, but we have included them for illustration purposes. The goal is to take aggregate bulk CPM values derived from scRNA-seq, cross reference them to a database of ligand and receptor pairs (Ramilowski, Jordan A et al. “A draft network of ligand-receptor-mediated multicellular signalling in human.” Nature communications vol. 6 7866. 22 Jul. 2015, doi:10.1038/ncomms8866), and then construct a network. The first step is to calculate the aggregate bulk CPM values, followed by annotating the ligands and receptors in the data.


```{r, include=TRUE, cache=FALSE, warning=FALSE, error=FALSE}
# This function is calculating our aggregate bulk CPM values. We have some thresholds (25 CPM or the value will be set to 0, and 15 minimum cells expressing within the group or the value set to 0). This is to greatly reduce the complexity of the network, and reduce spurious nodes with little supporting data.
phs002455_processed_UMItable <- calc_agg_bulk(phs002455_processed_UMItable, aggregate_by = c("Skin", "Cluster_Refined"), group_by = "Skin", cutoff_cpm = 25, cutoff_num = 15)
# The aggregate bulk data is written into fData()
colnames(fData(phs002455_processed_UMItable))[14:ncol(fData(phs002455_processed_UMItable))]
# This function will now cross reference the ligand and receptor database, and then annotate this information into fData
phs002455_processed_UMItable <- id_rl(phs002455_processed_UMItable)
# The ligand and receptor annotations are written into fData(). The IFNG entries are highlighted below.
fData(phs002455_processed_UMItable)[c("IFNG", "IFNGR1", "IFNGR2"),c(53:56)]
```

#### Plot Ligands

```{r, include=TRUE, cache=FALSE, warning=FALSE, error=FALSE, fig.height = 12, fig.width = 6}
vals <- which(fData(phs002455_processed_UMItable)[,"networks_ligands"])
ligs <- rownames(fData(phs002455_processed_UMItable))[vals]
length(ligs)
plot_heatmap(phs002455_processed_UMItable, genes = ligs, type = "bulk", cluster_by = "row",facet_by = "Cluster_Refined", pdf_format = "tile", scale_by = "row", cluster_type = "kmeans", k = 15)
```


#### Plot Receptors

```{r, include=TRUE, cache=FALSE, warning=FALSE, error=FALSE, fig.height = 12, fig.width = 6}
vals <- which(fData(phs002455_processed_UMItable)[,"networks_Receptors"])
recs <- rownames(fData(phs002455_processed_UMItable))[vals]
length(recs)
plot_heatmap(phs002455_processed_UMItable, genes = recs, type = "bulk", cluster_by = "row",facet_by = "Cluster_Refined", pdf_format = "tile", scale_by = "row", cluster_type = "kmeans", k = 15)
```

### Build ligand and receptor table

Now we can calculate the long format network table. Please note this step is slow and poorly optimized... it needs to be rewritten. On my laptop it takes about 30 minutes to run.

Please keep in mind the network scales exponentially with the number of groups you provide. For each individual ligand or receptor, a node is created when the cell type expresses it, and an edge between all of its expressed pairs. This means that if 5 cells express ligand A which binds to receptor B, there will be 5 ligand A nodes, and an edge from each of them to receptor B node. The more promiscuous a given ligand and receptor interaction, and the more cell types in the data, this network quickly explodes to become unmanageable in size.

We have found 10-15 cell types is the upper limit of what can be done on a local computer without needing a cluster environment.

For convenience a download link to the output file is provided below, however you may opt to run the command on your own.

```{r, include=TRUE, cache=FALSE, warning=FALSE, error=FALSE}
### Warning Slow!!! Use provided output table for speed. The command is left commented for speed reasons ### 
# network_table <- calc_rl_connections(phs002455_processed_UMItable, nodes = "Cluster_Refined", group_by = "Skin", print_progress = T)
load(url("https://www.dropbox.com/s/p30mwg5pcj58cc6/network_table.Rdata?dl=1"))
str(network_table)
# The lists contains 3 entries. 
# Summary provides a high level view of the number of connections between cell types.
# full_network is the same as full_network_raw, except that full_network_raw will preserve entries where either the ligand or receptor is 0 in expression. 
head(network_table$full_network)
```

Each row is defined as a cell type expressing a ligand, a cell type expressing the receptor, as well as the condition (Skin, ie Healthy, Non-lesional, Lesional), the expression values of the ligand and receptor, as well as the log10_Connection_product, which is a log10 transformed product of the ligand and expression CPM values. 

There are also some basic statistics for these connections (z scores, etc), however these are not used in the manuscript and are beyond the scope of this vignette.

### Build igraph object

Now we can take this long format table, and convert it into an igraph object. There are several ways this could be done...

One way would be to create 3 separate networks. One for healthy, one for non-lesional, one for lesional. However this has caveats. Because a ligand or receptor may go from OFF to ON, this means the landscape (the nodes and edges of the network) would change between conditions. To avoid this complexity, we opted to create a network that is a superset of all available networks (merge_all = T argument). In this way all possible nodes and edges are represented. Later on, we then query the original network_table in order to determine the edge values associate with each condition.

Please note that these functions write out files by DEFAULT. These files include plots and data. Set your working directory accordingly to track these files location. The prefix argument determines the prefix for these written files. Here we use the log10_Connection_product as the edge values for this network.


```{r, include=TRUE, cache=FALSE, warning=FALSE, error=FALSE}
network_igraph <- build_rl_network(input = network_table, merge_all = T, value = "log10_Connection_product", prefix = "scitranslmed.abd8995_network")
names(network_igraph)
```

The igraph_network is the full network

The layout is a 2D matrix with node positions

Clusters shows the cluster membership of each node

clusters_subgraphs contains the subgraphs for each cluster

interactive is the web browser based display. Keep in mind this renders in browser and can take a long time to fully render for the full network.

### Analyze igraph object

The network contains both a main connected body, as well as some orphan ligands and receptors with no annotated edge to connect them to the main network. For this reason, further clustering is only performed on the main graph (subset = 1, subset_on = "connected" arguments)

```{r, include=TRUE, cache=FALSE, warning=FALSE, error=FALSE}
network_igraph_C1_analyzed <- analyze_rl_network(network_igraph, subset = 1, subset_on = "connected", prefix = "scitranslmed.abd8995_network_C1", cluster_type = "louvain", merge_singles = F)
names(network_igraph_C1_analyzed)
```

The output contains statistics related to the clustering results, as well as individual cluster results and the interactive network.

The html files are a great way to view the network. Please note that they are rendered in the browser, so the full network can take a few minutes to display properly. There are drop down menus that allow you to select a particular node of interest, as well as a particular community (cluster) of the network.

### Plot network

```{r, include=TRUE, cache=FALSE, warning=FALSE, error=FALSE}
input_cell_net <- network_igraph_C1_analyzed$igraph_Network
V(input_cell_net)$membership <- network_igraph_C1_analyzed$Clusters_Results$membership
clusters <- network_igraph_C1_analyzed$Clusters_Results
weights_clusters <- ifelse(crossing(clusters, input_cell_net), 1, 10)
set.seed(10)
cluster_weighted_layout <- layout_with_fr(input_cell_net, weights = weights_clusters)
edge_colors <- ifelse(crossing(clusters, input_cell_net), "red", "black")
members <- V(input_cell_net)$membership
colopal <- RColorBrewer::brewer.pal(n = 8, name = "Dark2")
f <- colorRampPalette(colopal)
colopal <- f(length(unique(members)))
colopal <- colopal[as.factor(members)]
plot(input_cell_net,
     layout = cluster_weighted_layout,
     vertex.color = colopal,
     vertex.size = 1.5,
     vertex.label = "",
     vertex.label.cex = 1,
     vertex.label.color = "black",
     vertex.frame.color=colopal,
     edge.width = 0.1,
     edge.arrow.size = 0.01,
     edge.arrow.width = 0.1,
     edge.color = "gray50")
```