This document executes all the analysis presented in chromswitch: A flexible method for detecting chromatin state switches, from downloading the data, to running experiments, to generating the tables and figures included in the paper. The dropdown in the top right corner of this HTML doument controls whether code is shown or hidden. The document can be navigated using the menu on the left.
About the project:
In this paper, we present an R/Bioconductor package, chromswitch
, which implements a flexible method to detect chromatin state switches between samples in two biological conditions in a specific genomic region of interest given peaks or chromatin state calls from ChIP-seq data. chromswitch
is available from Bioconductor 3.6 at https://bioconductor.org/packages/chromswitch. We analyze its performance on a benchmark dataset, study its robustness to changes in tuning parameters, sample size and imbalance, window size, and varying mark profiles; and perform a genome-wide validation.
How this repository is set up:
analysis.Rmd
, and executes all the analysisinput
contains a few individual data and metadata files needed for the analysisdata
stores data which is downloaded during the analysisfigures
contains all figures produced in the analysis; the subfolder heatmaps
stores the figures output by the toolfunctions
stores a few long functions used (source()
’d by) the analysiscomputations
stores the results of computationally-intensive steps of the analysis, which are saved in the main document after being executed once, and then loaded for further analysistables
stores the data behind each figure as TSV files, and tables presented in the supplementary materialHow to re-run the analysis:
To run the analysis, this R Markdown file analysis.Rmd
needs to be knit. This involves executing every chunk, or piece of code, sequentially. To feasibly handle chunks with computationally-intensive code, we typically evaluate the chunk once, and then set eval = FALSE
for that chunk. The results are saved as an .RData
file in the computations
folder, and then loaded in a following chunk so that the results can be processed and visualized. To re-run the analysis, including these computationally-intensive steps, that option needs to be changed back to eval = TRUE
for every chunk.
Since the analysis would take multiple days to run, we recommend running the analysis from the command line, using the run.sh
bash script. If HPC resources are available, much of the work can be parallelized by modifying that script and setting the n_cores
variable in the next section accordingly.
Dependencies:
We used R 3.3.1 and Bioconductor 3.5. The R packages required are loaded below, see the exact versions in Session Info. All the data needed for the analysis is provided in the repository or downloaded in this document.
Load general packages:
# Visualization
library(RColorBrewer)
library(ggplot2)
# Install and download custom ggplot2 theme
devtools::install_github("sjessa/ggmin")
library(ggmin)
# Genomics
library(GenomicRanges)
library(rtracklayer)
library(Homo.sapiens)
library(biomaRt)
# General-purpose
library(pROC)
library(xtable)
library(BiocParallel)
library(parallel)
library(magrittr)
library(plyr)
library(purrr)
library(tidyverse)
library(dplyr)
library(stringi)
# Source functions
source("functions/plotting_functions.R")
source("functions/subsampling_functions.R")
source("functions/misc_functions.R")
Install chromswitch
from the GitHub repository:
devtools::install_github("sjessa/chromswitch@v0.99.0")
library(chromswitch)
Note on chromswitch version: chromswitch
is now available from Bioconductor 3.6, but for reproducibility, the above command will download v0.99.0 of chromswitch
, with which the bulk of the analysis was done: https://github.com/sjessa/chromswitch/releases/tag/v0.99.0.
Set-up for parallel computations:
# For other parallel computations
n_cores <- 1
chromswitch
is implemented in a relatively modular way, with a function for each step of the algorithm. It also provides two wrapper functions, one for each feature construction strategy, which runs the whole analysis. Although the wrapper functions in chromswitch can analyze multiple query regions in parallel, we will parallelize outside the function in order to catch any errors that might arise.
safeCallSummary <- purrr::safely(chromswitch::callSummary)
safeCallBinary <- purrr::safely(chromswitch::callBinary)
The data we analyze comes from the Roadmap Epigenomics Program. We evaluate three types of input:
We will work with 23 adult tissue samples from Roadmap. 7 are brain tissues, and 16 are various other tissues.
# Read in the samples
meta <- read_tsv("input/sample_metadata.tsv")
meta
table(meta$Condition)
##
## Brain Other
## 7 16
# Download the file if it doesn't exist
noClobberDownload <- function(url, destpath) {
if(!(file.exists(destpath))) download.file(url, destpath)
}
We retrieve the data from this link: http://egg2.wustl.edu/roadmap/data/byFileType/.
urls_k4 <- paste0("http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/narrowPeak/",
meta$Sample, "-H3K4me3.narrowPeak.gz")
basenames_k4 <- lapply(urls_k4, basename) %>% unlist()
destpaths_k4 <- paste0("data/H3K4me3_peaks/", basenames_k4)
dl_k4 <- mcMap(noClobberDownload, urls_k4, destpaths_k4, mc.cores = n_cores)
extra_cols <- c("name" = "character", "score" = "integer", "strand" = "character",
"signalValue" = "numeric", "pValue" = "numeric", "qValue" = "numeric",
"peak" = "numeric")
pk_k4me3 <- lapply(destpaths_k4, rtracklayer::import, format = "bed",
extraCols = extra_cols)
names(pk_k4me3) <- meta$Sample
save(pk_k4me3, file = "computations/pk_k4me3.RData")
urls_dnase <- paste0("http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidatedImputed/narrowPeak/",
meta$Sample,
"-DNase.imputed.narrowPeak.bed.nPk.gz")
basenames_dnase <- lapply(urls_dnase, basename) %>% unlist()
destpaths_dnase <- paste0("data/DNase_peaks/", basenames_dnase)
dl_dnase <- mcMap(noClobberDownload, urls_dnase, destpaths_dnase,
mc.cores = n_cores)
pk_dnase <- lapply(destpaths_dnase, rtracklayer::import, format = "bed")
names(pk_dnase) <- meta$Sample
save(pk_dnase, file = "computations/pk_dnase.RData")
To convert the chromatin state segmentations into a format that can be used by chromswitch
, we will filter to regions assigned “Active_TSS”, the state associated with active transcription.
urls_st <- paste0("http://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/",
meta$Sample, "_15_coreMarks_mnemonics.bed.gz")
basenames_st <- lapply(urls_st, basename) %>% unlist()
destpaths_st <- paste0("data/ChromHMM_segmentations/", basenames_st)
dl_st <- mcMap(noClobberDownload, urls_st, destpaths_st, mc.cores = 6)
extra_cols <- c("State" = "character")
chmm_states <- lapply(destpaths_st, rtracklayer::import, format = "bed",
extraCols = extra_cols) %>%
lapply(as.data.frame) %>%
# We will treat regions assigned the state associated with active
# transcription as the epigenomic features in our analysis
lapply(filter, State == "1_TssA") %>%
lapply(makeGRangesFromDataFrame)
names(chmm_states) <- meta$Sample
save(chmm_states, file = "computations/chmm_states.RData")
load("computations/pk_k4me3.RData")
load("computations/pk_dnase.RData")
load("computations/chmm_states.RData")
chromswitch
performanceWe manually assembled a list of 120 genes on which to benchmark chromswitch
performance. 60 of these genes exhibit chromatin state switches between brain and other tissues based on change in H3K4me3 signal (whose enrichment at promoters is associated with active transcription), DNase I hypersensitivity (associated with open chromatin), and expression (the functional consequence of chromatin changes). The other 60 are negative examples, where there was no change across the tissues in H3K4me3, DNase I, or expression data. These genes are specified in the file input/benchmark.tsv
.
To define windows for these genes to use as input for our experiment, we retrieve a list of genes and coordinates in the human genome from UCSC. This file is saved in the data
directory, and we obtained it using the UCSC Table Browser, with the following selections:
group
: Genes and Gene Predictionstrack
: RefSeqname
, chrom
, strand
, txStart
, txEnd
, and name2
all_genes <- read_tsv("data/refseq_genes.hg19.tsv",
col_names = c("transcript_id", "chromosome", "strand",
"txStart", "txEnd", "gene_name"), skip = 1)
head(all_genes)
Define 5kbp windows around the TSS:
benchmark <- read_tsv("input/benchmark.tsv") %>%
left_join(all_genes, by = c("transcript_id", "gene_name")) %>%
# Define window around the TSS
mutate(start = case_when(
strand == "+" ~ as.numeric(txStart) - 2500,
strand == "-" ~ as.numeric(txEnd) - 2500)) %>%
mutate(end = case_when(
strand == "+" ~ as.numeric(txStart) + 2500,
strand == "-" ~ as.numeric(txEnd) + 2500)) %T>%
write_tsv("input/benchmark_coords.5kbp.tsv") %>%
makeGRangesFromDataFrame(keep.extra.columns = TRUE) %>%
keepStandardChromosomes()
Table S1
benchmark %>%
as.data.frame %>%
dplyr::select(gene_name, transcript_id, chr = seqnames, start, end, switch) %>%
arrange(desc(switch), gene_name) %T>%
write_tsv("tables/tableS2.tsv") %>%
xtable()
In order to later validate chromswitch
genome-wide, we also clean up the gene list to define 5kbp windows genome-wide.
refseq_unique <- all_genes %>%
# Keep the longest transcript per gene
mutate(length = txEnd - txStart) %>%
group_by(gene_name) %>%
dplyr::slice(which.max(length)) %>%
ungroup() %>%
# Filter out genes where the length of the gene is less than half the
# window size
filter(length >= 2500)
getRefSeq <- function(window) {
d <- window / 2
refseq_unique %>%
# Define window around the TSS
mutate(start = case_when(
strand == "+" ~ as.numeric(txStart) - d,
strand == "-" ~ as.numeric(txEnd) - d)) %>%
mutate(end = case_when(
strand == "+" ~ as.numeric(txStart) + d,
strand == "-" ~ as.numeric(txEnd) + d)) %T>%
write_tsv(paste0("input/benchmark_coords.", window, ".tsv")) %>%
makeGRangesFromDataFrame(keep.extra.columns = TRUE) %>%
keepStandardChromosomes(pruning.mode = "coarse")
}
refseq_5kbp <- getRefSeq(5000)
chromswitch
on benchmark datasetHere we evaluate each feature matrix construction strategy on each type of input.
Summary strategy applied to H3K4me3
bench_s_k_raw <- mclapply(benchmark, safeCallSummary,
peaks = pk_k4me3,
metadata = meta,
mark = "H3K4me3",
filter = TRUE,
filter_columns = c("qValue", "signalValue"),
filter_thresholds = c(10, 4),
normalize = TRUE,
normalize_columns = c("qValue", "signalValue"),
summarize_columns = c("qValue", "signalValue"),
length = FALSE,
fraction = TRUE,
n = FALSE,
optimal_clusters = TRUE,
heatmap = FALSE,
BPPARAM = SerialParam(), mc.cores = n_cores) %>%
purrr::transpose()
bench_s_k <- bench_s_k_raw$result[map_lgl(bench_s_k_raw$error, is_null)] %>%
bind_rows()
save(bench_s_k, file = "computations/bench_s_k.RData")
Summary strategy applied to DNase I
bench_s_d_raw <- mclapply(benchmark, safeCallSummary,
peaks = pk_dnase,
metadata = meta,
mark = "DNase",
filter = FALSE,
normalize = FALSE,
summarize_columns = NULL,
length = FALSE,
fraction = TRUE,
n = FALSE,
optimal_clusters = TRUE,
heatmap = FALSE,
BPPARAM = SerialParam(),
mc.cores = n_cores) %>%
purrr::transpose()
bench_s_d <- bench_s_d_raw$result[map_lgl(bench_s_d_raw$error, is_null)] %>%
bind_rows()
save(bench_s_d, file = "computations/bench_s_d.RData")
Binary strategy applied to H3K4me3
bench_b_k_raw <- mclapply(benchmark,
safeCallBinary,
peaks = pk_k4me3,
metadata = meta,
filter = TRUE,
filter_columns = c("qValue","signalValue"),
filter_thresholds = c(10, 4),
reduce = TRUE,
gap = 300,
p = 0.4,
optimal_clusters = TRUE,
n_features = TRUE, mc.cores = n_cores) %>%
purrr::transpose()
bench_b_k <- bench_b_k_raw$result[map_lgl(bench_b_k_raw$error,
is_null)] %>%
bind_rows()
save(bench_b_k, file = "computations/bench_b_k.RData")
Binary strategy applied to DNase I
bench_b_d_raw <- mclapply(benchmark,
safeCallBinary,
peaks = pk_dnase,
metadata = meta,
filter = FALSE,
reduce = TRUE,
gap = 300,
p = 0.4,
optimal_clusters = TRUE,
n_features = TRUE, mc.cores = n_cores) %>%
purrr::transpose()
bench_b_d <- bench_b_d_raw$result[map_lgl(bench_b_d_raw$error, is_null)] %>%
bind_rows()
save(bench_b_d, file = "computations/bench_b_d.RData")
Summary strategy applied to ChromHMM
bench_s_c_raw <- mclapply(benchmark,
safeCallSummary,
peaks = chmm_states,
metadata = meta,
mark = "State",
filter = FALSE,
normalize = FALSE,
summarize_columns = NULL,
length = FALSE,
fraction = TRUE,
n = FALSE,
heatmap = FALSE,
optimal_clusters = TRUE,
BPPARAM = SerialParam(),
mc.cores = n_cores) %>%
purrr::transpose()
bench_s_c <- bench_s_c_raw$result[map_lgl(bench_s_c_raw$error, is_null)] %>%
bind_rows()
save(bench_s_c, file = "computations/bench_s_c.RData")
Binary strategy applied to ChromHMM
bench_b_c_raw <- mclapply(benchmark,
safeCallBinary,
peaks = chmm_states,
metadata = meta,
filter = FALSE,
reduce = FALSE,
p = 0.4,
optimal_clusters = TRUE,
n_features = TRUE,
mc.cores = n_cores) %>%
purrr::transpose()
bench_b_c <- bench_b_c_raw$result[map_lgl(bench_b_c_raw$error,
is_null)] %>%
bind_rows()
save(bench_b_c, file = "computations/bench_b_c.RData")
load("computations/bench_s_k.RData")
load("computations/bench_s_d.RData")
load("computations/bench_s_c.RData")
load("computations/bench_b_k.RData")
load("computations/bench_b_d.RData")
load("computations/bench_b_c.RData")
We compare with a random predictor by drawing from from a \(Uniform(0, 1)\) distribution, repeating this 10,000 times, and then averaging the results to assign a score. This is done for each region. It should track with the diagonal and have an AUROC of around 0.5
#' @param k, the number of query regions to make a prediction for
#' @param N, the number of times to draw
predictRandom <- function(k, N) {
rep(k, N) %>%
mclapply(runif, mc.cores = 2) %>%
data.frame() %>%
rowMeans()
}
# Repeat twice, to get one set of random predictions per method
draws <- replicate(2, predictRandom(120, 10000), simplify = FALSE)
save(draws, file = "computations/draws.RData")
load("computations/draws.RData")
bench_s_k <- bench_s_k %>%
add_column(Random = draws[[1]], .after = "Consensus")
bench_b_d <- bench_b_d %>%
add_column(Random = draws[[2]], .after = "Consensus")
Our method predicts chromatin state switches by computing a score of the similarity between the known biological condition labels and inferred clusters. We can assess how these scores look when we shuffle the class labels. We cluster samples based on the feature matrix as usual, but before calculating any of the external validity indices, we randomly shuffle the labels.
predictShuffle <- function(clusters, N) {
shuffle <- function(labels, clusters) {
shuf <- sample(labels)
contingency <- table(shuf, clusters)
data.frame(
ARI_shuffle = mclust::adjustedRandIndex(clusters, shuf),
NMI_shuffle = chromswitch:::NMI(clusters, shuf),
V_measure_shuffle = chromswitch:::vMeasure(contingency)
)
}
# Repeat N times
replicate(N, shuffle(labels = meta$Condition, clust = unlist(clusters)),
simplify = FALSE) %>%
bind_rows() %>%
colMeans()
}
shuffle_s_k <- bench_s_k %>%
dplyr::select(matches("^E[0-9]{3}")) %>%
t() %>%
as.data.frame %>%
mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>%
as.data.frame() %>%
t() %>%
as.data.frame()
shuffle_s_d <- bench_s_d %>%
dplyr::select(matches("^E[0-9]{3}")) %>%
t() %>%
as.data.frame %>%
mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>%
as.data.frame() %>%
t() %>%
as.data.frame()
shuffle_s_c <- bench_s_c %>%
dplyr::select(matches("^E[0-9]{3}")) %>%
t() %>%
as.data.frame %>%
mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>%
as.data.frame() %>%
t() %>%
as.data.frame()
save(shuffle_s_k, shuffle_s_d, shuffle_s_c,
file = "computations/shuffle_s.RData")
shuffle_b_k <- bench_b_k %>%
dplyr::select(matches("^E[0-9]{3}")) %>%
t() %>%
as.data.frame %>%
mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>%
as.data.frame() %>%
t() %>%
as.data.frame()
shuffle_b_d <- bench_b_d %>%
dplyr::select(matches("^E[0-9]{3}")) %>%
t() %>%
as.data.frame %>%
mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>%
as.data.frame() %>%
t() %>%
as.data.frame()
shuffle_b_c <- bench_b_c %>%
dplyr::select(matches("^E[0-9]{3}")) %>%
t() %>%
as.data.frame %>%
mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>%
as.data.frame() %>%
t() %>%
as.data.frame()
save(shuffle_b_c, shuffle_b_d, shuffle_b_k,
file = "computations/shuffle_b.RData")
load("computations/shuffle_s.RData")
load("computations/shuffle_b.RData")
addShuf <- function(df) {
df %>%
mutate(Consensus_shuffle =
rowMeans(dplyr::select(., ARI_shuffle, NMI_shuffle, V_measure_shuffle)))
}
bench_s_k <- bind_cols(bench_s_k, shuffle_s_k) %>% addShuf()
bench_s_d <- bind_cols(bench_s_d, shuffle_s_d) %>% addShuf()
bench_s_c <- bind_cols(bench_s_c, shuffle_s_c) %>% addShuf()
bench_b_k <- bind_cols(bench_b_k, shuffle_b_k) %>% addShuf()
bench_b_d <- bind_cols(bench_b_d, shuffle_b_d) %>% addShuf()
bench_b_c <- bind_cols(bench_b_c, shuffle_b_c) %>% addShuf()
The problem of a detecting a chromatin state switch in a region can be reframed as the problem of classifying the region as containing a switch between conditions or not. Therefore, our method can be regarded as a binary classifier, and evaluated accordingly. We evaluate the performance of our method on the benchmark set by computing Receiver-Operating Characteristic (ROC) curves based on the consensus score computed by chromswitch
. ROC curves measure the tradeoff between the true positive rate and the false positive rate as the threshold on the consensus score is varied, as well as the area under the ROC curve (AUROC) which takes on a value of 1 for a perfect classifier and can be interpreted as the probability that chromswitch
will score a randomly chosen region corresponding to a chromatin state switch higher than a randomly chosen region which does not exhibit a switch.
calcROC <- function(scores, method, input, smooth) {
# Compute ROC info using each score as a predictor
roc_out <- scores %>%
dplyr::select(
matches("Consensus|Random")) %>%
map(~ pROC::roc(response = scores$switch,
predictor = as.numeric(.),
smooth = smooth))
# Extract values to plot ROC curves
roc_val <- roc_out %>%
map(`[`, c("sensitivities", "specificities")) %>%
map(as.data.frame) %>%
map2_df(names(roc_out), function(df, index) mutate(df, index = index)) %>%
mutate(TPR = sensitivities,
FPR = 1 - specificities,
Method = method,
Input = input)
return(roc_val)
}
benchmark_roc <- bind_rows(
calcROC(bench_s_k, "Summary", "H3K4me3", smooth = TRUE),
calcROC(bench_s_d, "Summary", "DNase I", smooth = TRUE),
calcROC(bench_s_c, "Summary", "ChromHMM", smooth = TRUE),
calcROC(bench_b_k, "Binary", "H3K4me3", smooth = TRUE),
calcROC(bench_b_d, "Binary", "DNase I", smooth = TRUE),
calcROC(bench_b_c, "Binary", "ChromHMM", smooth = TRUE)) %>%
filter(grepl("Consensus|Random", index)) %>%
filter(index != "Random" |
(index == "Random") & (Method == "Binary") & (Input == "DNase I")) %>%
mutate(Input = ifelse(index == "Random", "NA", Input)) %T>%
write_tsv("tables/benchmark_roc.tsv")
roc_gg <- benchmark_roc %>%
makeLabel()
Figure 1b
roc_gg %>%
filter(Input != "DNase I") %>%
filter(index != "Consensus_shuffle") %>%
ggplot(aes(x = FPR, y = TPR, colour = Label, linetype = Label)) +
geom_line(size = 1) +
scale_colour_manual(name = "Strategy", values = pal_strategy) +
scale_linetype_manual(name = "Strategy", values = linetype_strategy) +
theme_min() +
xlab("False Positive Rate") + ylab("True Positive Rate") +
theme(legend.position = "bottom") +
labs(colour = "") + labs(linetype = "") +
guides(colour = guide_legend(ncol = 1))
Figure S2
roc_gg %>%
filter(index != "Random") %>%
ggplot(aes(x = FPR, y = TPR, colour = Label, linetype = Label)) +
geom_line(size = 1) +
scale_colour_manual(name = "Strategy", values = pal_strategy) +
scale_linetype_manual(name = "Strategy", values = linetype_strategy2) +
theme_min() +
# theme(legend.position = "bottom") +
xlab("False Positive Rate") + ylab("True Positive Rate") +
theme(legend.position = "bottom") +
facet_wrap(~ Input) +
labs(colour = "") + labs(linetype = "") +
guides(colour = guide_legend(ncol = 1))
calcAUC <- function(scores, method, input) {
scores %>%
dplyr::select(
matches("Consensus|Random")) %>%
map(~pROC::roc(response = scores$switch, predictor = as.numeric(.))) %>%
map_df("auc") %>%
mutate(Method = method,
Input = input)
}
benchmark_auc <- bind_rows(
calcAUC(bench_s_k, "Summary", "H3K4me3"),
calcAUC(bench_s_d, "Summary", "DNase I"),
calcAUC(bench_s_c, "Summary", "ChromHMM"),
calcAUC(bench_b_k, "Binary", "H3K4me3"),
calcAUC(bench_b_d, "Binary", "DNase I"),
calcAUC(bench_b_c, "Binary", "ChromHMM")) %>%
select(Method, Input, Consensus, Consensus_shuffle, Random)
benchmark_auc
Figure S2
# A little manipulation to make the table for the supplementary
benchmark_auc %>%
gather(Stat, Score, Consensus, Consensus_shuffle) %>%
rowwise() %>%
mutate(Method = ifelse(grepl("shuffle", Stat),
paste0(Method, " (shuffled)"), Method)) %>%
ungroup() %>%
dplyr::select(Method, Input, Score) %>%
spread(Input, Score) %T>%
write_tsv("tables/benchmark_auc.tsv") %>%
xtable() # Generate latex code
There are two parameters used in the feature construction process for the binary approach, gap
which controls whether nearby peaks within samples are connected, and p
which controls how much overlap is required to call two peaks the same. To assess the impact of these parameters, we’ll perform a grid search on various combinations.
Since we want to investigate the effect of varying the tuning parameters, we select the best threshold on the Consensus score on this benchmarking dataset:
getBest <- function(bench_out) {
best_consensus <- pROC::roc(predictor = bench_out$Consensus,
response = bench_out$switch) %>%
pROC::coords("best",
ret = c("threshold", "sensitivity",
"specificity", "accuracy", "ppv",
as.list = TRUE, drop = TRUE))
best_threshold <- best_consensus["threshold"]
return(best_threshold)
}
best_k <- getBest(bench_b_k)
best_k
## threshold
## 0.2156931
grid <- expand.grid(gap = seq(200, 500, by = 50),
p_overlap = seq(0.2, 0.8, by = 0.1))
evaluateParams <- function(gap, p, best_threshold, ...) {
out <- benchmark %>%
mclapply(safeCallBinary, gap = gap, p = p, ...) %>%
purrr::transpose()
ok <- map_lgl(out$error, is_null)
out <- out$result[ok] %>%
bind_rows() %>%
mutate(pred = ifelse(Consensus >= best_threshold, 1, 0))
# Now we can tally the results by comparing the predictions
# to the known classes (switch or not)
stats <- table(benchmark$switch[ok], out$pred)
tn = fn = tp = fp = 0
try(tn <- stats["0", "0"], silent = TRUE)
try(fn <- stats["1", "0"], silent = TRUE)
try(tp <- stats["1", "1"], silent = TRUE)
try(fp <- stats["0", "1"], silent = TRUE)
df <- data.frame(gap, p, tp, tn, fp, fn)
return(list(predictions = out$pred, stats = df))
}
safeEvalParams <- purrr::safely(evaluateParams)
grid_out <- Map(function(gap, p) safeEvalParams(gap, p,
best_threshold = best_k,
peaks = pk_k4me3,
metadata = meta,
filter = TRUE,
filter_columns = c("qValue", "signalValue"),
filter_thresholds = c(10, 4),
reduce = TRUE,
heatmap = FALSE,
BPPARAM = SerialParam(),
optimal_clusters = TRUE,
mc.cores = n_cores),
grid$gap, grid$p_overlap) %>%
purrr::transpose()
save(grid_out, file = "computations/grid.RData")
Figure S4
load("computations/grid.RData")
makeGrid(grid_out, "H3K4me3")
load("tables/grid_scores.H3K4me3.RData")
plotGrid(grid_scores)
Next, we investigate how sensitive or robust chromswitch
is to small numbers of samples and/or to datasets with high class imbalance (i.e. when samples in one condition greatly outnumber samples in the other). To do so, we will subsample from the samples in the full 23-tissue dataset to create many smaller datasets with fewer samples and greater class imbalance, and evaluate how chromswitch
performs in classifying the benchmark regions using these smaller datasets.
For the summary feature matrix construction strategy, the computation of the feature vector for each sample is independent of all other samples, so we can save the feature matrices and subsample rows from the matrices. Here we compute and save the feature matrix for each region:
getSummaryFeatures <- function(query) {
cols <- c("qValue", "signalValue")
ft_mat <- pk_k4me3 %>%
filterPeaks(columns = cols,
thresholds = c(10, 4)) %>%
normalizePeaks(columns = cols) %>%
retrievePeaks(metadata = meta, region = query) %>%
summarizePeaks(mark = "H3K4me3", cols = cols,
length = FALSE, fraction = TRUE, n = FALSE)
return(ft_mat)
}
safeGetFeatures <- purrr::safely(getSummaryFeatures)
ft_s_k_raw <- benchmark %>%
mclapply(safeGetFeatures, mc.cores = n_cores) %>%
purrr::transpose()
ft_s_k <- ft_s_k_raw$result
save(ft_s_k, ft_s_k_raw,
file = "computations/benchmark_features_s_k.RData")
For the binary strategy, the feature matrices are derived from the union of peaks supplied across samples, and are therefore dependent on the exact set of samples, so we will only save the local peaks (the peaks in the query region) for each sample.
getBinaryPeaks <- function(query) {
pks <- pk_k4me3 %>%
retrievePeaks(metadata = meta, region = query) %>%
reducePeaks(gap = 300)
return(pks)
}
safeGetBinaryPeaks <- purrr::safely(getBinaryPeaks)
pk_b_k_raw <- benchmark %>%
mclapply(safeGetBinaryPeaks, mc.cores = n_cores) %>%
purrr::transpose()
pk_b_k <- pk_b_k_raw$result
# Extract the peaks from the localPeaks object
lpk_b_k <- lapply(pk_b_k, chromswitch:::lpkPeaks)
save(pk_b_k_raw, pk_b_k, lpk_b_k,
file = "computations/benchmark_peaks_b_k.RData")
load("computations/benchmark_features_s_k.RData")
load("computations/benchmark_peaks_b_k.RData")
These are the different datasets we’ll construct, each with a different number and proportion of brain & other samples.
combos <- data.frame(
n_other = c(2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16, 16, 16, 10, 12, 15, 16, 16),
n_brain = c(2, 3, 4, 5, 6, 7, 7, 7, 7, 7, 7, 2, 3, 5, 4, 3, 4, 5)
)
combos
lvls_combos <- c(unlist(Map(function(x, y)
paste0(x ," brain vs. ", y, " other"),
combos$n_brain, combos$n_other)))
idx_other <- which(meta$Condition == "Other")
idx_brain <- which(meta$Condition == "Brain")
Since in some cases it’s possible that there are few enough combinations that we can exhaustively try all of them, we will do so when there are fewer than 100 combinations. Otherwise, we will evaluate 100 randomly subsample ddatasets to for each combination.
How many possible datasets exist for each pair of n_brain
and n_other
?
n_pos <- function(n_other, n_brain, idx_other) choose(length(idx_other), n_other) * choose(length(idx_brain), n_brain)
combos$n_pos <- unlist(Map(function(n_other, n_brain) n_pos(n_other, n_brain, idx_other),
combos$n_other, combos$n_brain))
combos
For the code below, the functions that carry out the subsampling experiment are sourced from functions/subsampling_functions.R
.
safeSubsampleFt <- purrr::safely(subsampleFromFeatures)
safeSubsamplePk <- purrr::safely(subsampleFromPeaks)
subsamp_s_raw <- mcMap(function(n_other, n_brain)
safeSubsampleFt(ft_s_k, n_other, n_brain, idx_other),
combos$n_other, combos$n_brain,
mc.cores = n_cores) %>%
purrr::transpose()
subsamp_s <- subsamp_s_raw$result
save(subsamp_s, subsamp_s_raw, file = "computations/subsamp_s.RData")
subsamp_b_raw <- mcMap(function(n_other, n_brain)
safeSubsamplePk(pk_b_k, n_other, n_brain, idx_other),
combos$n_other, combos$n_brain,
mc.cores = n_cores) %>%
purrr::transpose()
subsamp_b <- subsamp_b_raw$result
save(subsamp_b, subsamp_b_raw, file = "computations/subsamp_b.RData")
Now we can process the results, and calculate mean AUC scores for each combination so we can quantify performance.
load("computations/subsamp_s.RData")
load("computations/subsamp_b.RData")
auc_s <- subsamp_s %>% map_df("auc") %>%
mutate(Method = "Summary", Input = "H3K4me3")
auc_b <- subsamp_b %>% map("scores") %>%
map_df("auc") %>%
mutate(Method = "Binary", Input = "H3K4me3")
# Simplify the facet labels
lvls_combos_new <- c(unlist(Map(function(x, y)
paste0(x ," vs. ", y),
combos$n_brain, combos$n_other)))
lvls_combos_new[1] <- "2 brain vs. 2 other"
# Calculate the mean AUC for each combination of n_brain, n_other
mean_auc <- bind_rows(auc_s, auc_b) %>%
rowwise() %>%
mutate(dataset = ifelse(n_brain == 2 && n_other == 2,
"2 brain vs. 2 other",
paste0(n_brain, " vs. ", n_other))) %>%
ungroup() %>%
mutate(dataset = factor(dataset, levels = lvls_combos_new)) %>%
dplyr::select(dataset, iteration, AUC = Consensus, Method, Input) %>%
group_by(dataset, Method, Input) %>%
summarise(mean_AUC = paste0("AUC: ", round(mean(AUC), 2))) %T>%
write_tsv("tables/subsampling_auc.tsv")
roc_s <- subsamp_s %>% map_df("roc") %>%
mutate(Method = "Summary", Input = "H3K4me3")
roc_b <- subsamp_b %>% map("scores") %>%
map_df("roc") %>%
mutate(Method = "Binary", Input = "H3K4me3")
roc_df <- bind_rows(roc_s, roc_b) %>%
rowwise() %>%
mutate(dataset = ifelse(n_brain == 2 && n_other == 2,
"2 brain vs. 2 other",
paste0(n_brain, " vs. ", n_other))) %>%
ungroup() %>%
mutate(dataset = factor(dataset, levels = lvls_combos_new)) %>%
filter(index == "Consensus") %>%
dplyr::select(-index)
# Ensures that the iterations are plot in an order that looks nice
# with the colours of the lines
roc_df$iteration <- factor(roc_df$iteration,
levels = sort(unique(roc_df$iteration),
decreasing = TRUE))
In the figures below, the ROC curve for every individual subsample dataset is plot in a shade of gray, while the mean across iterations, for each combination, is plot in red (summary strategy) or blue (binary strategy).
Figure S3a
roc_df %T>%
write_tsv("tables/subsampling_roc.tsv") %>%
filter(Method == "Summary" & Input == "H3K4me3") %>%
plotSubsamp("red", "Summary", "H3K4me3")
Figure S3b
roc_df %>%
filter(Method == "Binary" & Input == "H3K4me3") %>%
plotSubsamp("blue", "Binary", "H3K4me3")
Since the chromswitch
algorithm depends on the data in the genomic query window, which is selected by the user, we next considered the effect of variations in the window size. We studied how the consensus score changed as a function of the window size for three genes in our benchmark dataset, using H3K4me3 peaks.
window_genes <- c("GRIK2", "CHRM1", "TTYH1")
windows <- refseq_5kbp %>%
as.data.frame %>%
filter(gene_name %in% window_genes) %>%
select(-start, -end)
get_window <- function(window_size, genes) {
d <- window_size / 2
genes %>%
# Define window around the TSS
mutate(start = case_when(
strand == "+" ~ as.numeric(txStart) - d,
strand == "-" ~ as.numeric(txEnd) - d)) %>%
mutate(end = case_when(
strand == "+" ~ as.numeric(txStart) + d,
strand == "-" ~ as.numeric(txEnd) + d)) %>%
mutate(gene_name = paste0(gene_name, "_", window_size/1000, "kbp"))
}
# Window sizes to test, in base pairs
sizes <- c(seq(1000, 10000, by = 1000),
seq(10000, 50000, by = 10000),
100000, 250000, 500000)
windows_gr <- lapply(sizes, get_window, windows) %>% bind_rows() %>%
makeGRangesFromDataFrame(keep.extra.columns = TRUE)
win_s <- callSummary(windows_gr,
peaks = pk_k4me3,
metadata = meta,
mark = "H3K4me3",
filter = TRUE,
filter_columns = c("qValue", "signalValue"),
filter_thresholds = c(10, 4),
normalize = TRUE,
normalize_columns = c("qValue", "signalValue"),
summarize_columns = c("qValue", "signalValue"),
length = FALSE,
fraction = TRUE,
n = FALSE,
optimal_clusters = TRUE,
heatmap = TRUE,
titles = paste0(windows_gr$gene_name, "_summary"),
outdir = "figures/heatmaps")
win_s
save(win_s, file = "computations/windowsize_s.RData")
win_b <- callBinary(query = win_to_test,
peaks = pk_k4me3,
metadata = meta,
filter = TRUE,
filter_columns = c("qValue","signalValue"),
filter_thresholds = c(10, 4),
reduce = TRUE,
gap = 300,
p = 0.4,
optimal_clusters = TRUE,
heatmap = TRUE,
titles = paste0(windows_gr$gene_name, "_binary"),
outdir = "figures/heatmaps",
n_features = TRUE)
win_b
save(win_b, file = "computations/windowsize_b.RData")
Figure S5
load("computations/windowsize_s.RData")
load("computations/windowsize_b.RData")
windows <- bind_rows(mutate(win_s, method = "Summary"), mutate(win_b, method = "Binary")) %>%
select(gene_name, Consensus, method) %>%
separate(gene_name, into = c("gene_name", "window_size"), sep = "_") %>%
separate(window_size, into = c("window_size", "drop"), sep = "k") %>%
select(-drop) %>%
mutate(window_size = as.numeric(window_size)) %T>%
write_tsv("tables/window_size.tsv")
# Plot TTYH1, "zoomed" in in the 0-50kbp range and in the 0-500kbp range
ttyh1_short <- windows %>%
filter(gene_name == "TTYH1") %>%
filter(window_size <= 50) %>%
mutate(range = "50kbp")
ttyh1_long <- windows %>%
filter(gene_name == "TTYH1") %>%
filter(window_size <= 500) %>%
mutate(range = "500kbp")
bind_rows(ttyh1_short, ttyh1_long) %>%
mutate(range = factor(range, levels = c("50kbp", "500kbp"))) %>%
ggplot(aes(x = window_size, y = Consensus)) +
geom_point(aes(colour = method)) +
geom_line(aes(colour = method), alpha = 0.8, size = 0.5) +
scale_colour_manual(name = "Method", values = pal_methods) +
xlab("Window Size (kbp)") + ylab("Consensus score") +
facet_wrap(~ range, ncol = 1, scales = "free_x") +
ggmin::theme_min() +
theme(legend.position = c(0.9, 0.33)) +
theme(strip.background = element_blank(), strip.text.x = element_blank())
This package was initially designed to facilitate the task of identifying epigenetic changes for challenging datasets with abnormal patterns of enrichment of certain histone modifications, like H3K27me3. To analyze how chromswitch
performs for a mark with a broad profile of enrichment, we will investigate a few regions where we expect a difference in K27me3 enrichment between human embryonic stem cells (ESC) and differentiated adult tissues.
We use ESC samples from the Roadmap Epigenomics Project, compared to the same adult tissues we used above.
esc_meta <- meta %>%
filter(Condition == "Other") %>%
mutate(Condition = "Adult") %>%
bind_rows(read_tsv("input/esc_metadata.tsv"))
## Parsed with column specification:
## cols(
## Sample = col_character(),
## Tissue = col_character(),
## Condition = col_character()
## )
esc_meta
k27m_regions <- c(GRanges("chr7:27122134-27298855"), # HOX
GRanges("chr16:54125741-57125741"), # IRX3 3 Mbp region
GRanges("chr16:54313663-54328454")) # IRX3
mcols(k27m_regions)$id <- c("HOX", "IRX3_3Mbp", "IRX3")
Download narrow peaks for H3K27me3:
urls_k27 <- paste0("http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/narrowPeak/",
esc_meta$Sample, "-H3K27me3.narrowPeak.gz")
basenames_k27 <- lapply(urls_k27, basename) %>% unlist()
destpaths_k27 <- paste0("data/H3K27me3_peaks/", basenames_k27)
dl_k27 <- mcMap(noClobberDownload, urls_k27, destpaths_k27, mc.cores = n_cores)
extra_cols <- c("name" = "character", "score" = "integer", "strand" = "character",
"signalValue" = "numeric", "pValue" = "numeric", "qValue" = "numeric",
"peak" = "numeric")
pk_k27me3 <- lapply(destpaths_k27, rtracklayer::import, format = "bed",
extraCols = extra_cols)
names(pk_k27me3) <- esc_meta$Sample
save(pk_k27me3, file = "computations/pk_k27me3.RData")
chromswitch
using H3K27me3 peaksk27m_s <- callSummary(k27m_regions,
peaks = pk_k27me3,
metadata = esc_meta,
mark = "H3K27me3",
filter = TRUE,
filter_columns = c("qValue", "signalValue"),
filter_thresholds = c(10, 10),
normalize = TRUE,
normalize_columns = c("qValue", "signalValue"),
summarize_columns = c("qValue", "signalValue"),
length = FALSE,
fraction = TRUE,
n = FALSE,
optimal_clusters = TRUE,
heatmap = TRUE,
outdir = "figures/heatmaps",
titles = paste0(k27m_regions$id, "_summary"))
save(k27m_s, file = "computations/h3k27me3_summary.RData")
k27m_b <- callBinary(query = k27m_regions,
peaks = pk_k27me3,
metadata = esc_meta,
filter = TRUE,
filter_columns = c("qValue","signalValue"),
filter_thresholds = c(10, 10),
reduce = TRUE,
gap = 300,
p = 0.4,
optimal_clusters = TRUE,
heatmap = TRUE,
n_features = TRUE,
outdir = "figures/heatmaps",
titles = paste0(k27m_regions$id, "_binary"))
save(k27m_b, file = "computations/h3k27me3_binary.RData")
Presented in Figures S6-8
load("computations/h3k27me3_summary.RData")
load("computations/h3k27me3_binary.RData")
# By the summary strategy
k27m_s %>%
mutate(Method = "Summary") %>%
select(region, id, k, Average_Silhouette, Consensus, Method)
# By the binary strategy
k27m_b %>%
mutate(Method = "Binary") %>%
select(region, id, k, Average_Silhouette, Consensus, Method)
To validate chromswitch
, we will evaluate whether it’s capable of detecting chromatin state switches that result in brain-specific expression. We start by running chromswitch
genome-wide to 5kb windows around TSS. Chromswitch
makes a simple assessment of how chromatin states inferred from the data associate with the biological conditions of the samples, but we will need a more refined set of candidates to reasonably expect an association with gene expression data. Therefore, as candidates for validation, we select genes where the chromswitch
finds an open chromatin state in brain tissues and closed chromatin in other tissues, evaluate whether these correspond to brain-specific expression.
We downloaded data from the Genotype-Tissue Expression Project database. The file data/GTEx_Analysis_v6p_RNA-seq_RNA-SeQCv1.1.8_gene_median_rpkm.gct
was downloaded from the RNA-seq section at this link: https://gtexportal.org/home/datasets and contains the mean RPKM per tissue type per gene.
First we assign more concise names to the tissues and filter to a set which is comparable to our epigenomic dataset.
keep_tissue <- c("Amygdala", "Anterior cingulate cortex", "Basal ganglia",
"Cerebella hemisphere", "Cerebellum", "Cortex",
"Frontal cortex",
"Hippocampus", "Hypothalamus", "Substantia nigra", "Aorta",
"Sigmoid colon", "Transverse colon", "Esophagus",
"Heart (atrial appendage)", "Heart (left ventricle)",
"Kidney", "Liver", "Lung", "Skeletal muscle",
"Pancreas", "Small intestine", "Stomach")
brain <- c("Amygdala", "Anterior cingulate cortex", "Basal ganglia",
"Cerebella hemisphere", "Cerebellum", "Cortex", "Frontal cortex",
"Hippocampus", "Hypothalamus", "Substantia nigra")
other <- keep_tissue[!(keep_tissue %in% brain)]
gtex_rpkm <- read_tsv(
"data/GTEx_Analysis_v6p_RNA-seq_RNA-SeQCv1.1.8_gene_median_rpkm.gct",
skip = 2) %>%
dplyr::rename(gene_name = Description) %>%
unite(col = "unite", sep = "=", Name, gene_name) %>%
gather(key = "Tissue", value = "mean_RPKM", 2:length(.)) %>%
mutate(Tissue = recode(.x = Tissue,
`Brain - Amygdala` = "Amygdala",
`Brain - Anterior cingulate cortex (BA24)` = "Anterior cingulate cortex",
`Brain - Caudate (basal ganglia)` = "Basal ganglia",
`Brain - Cerebellar Hemisphere` = "Cerebellar hemisphere",
`Brain - Cerebellum` = "Cerebellum",
`Brain - Cortex` = "Cortex",
`Brain - Frontal Cortex (BA9)` = "Frontal cortex",
`Brain - Hippocampus` = "Hippocampus",
`Brain - Hypothalamus` = "Hypothalamus",
`Brain - Nucleus accumbens (basal ganglia)` = "Nucleus accumbens",
`Brain - Putamen (basal ganglia)` = "Putamen",
`Brain - Substantia nigra` = "Substantia nigra",
`Adipose - Subcutaneous` = "Subcutaneous adipose",
`Adipose - Visceral (Omentum)` = "Omentum",
`Artery - Aorta` = "Aorta",
`Artery - Coronary` = "Coronary artery",
`Artery - Tibial` = "Tibial artery",
`Bladder` = "Bladder",
`Breast - Mammary Tissue` = "Breast",
`Cervix - Ectocervix` = "Ectocervix",
`Cervix - Endocervix` = "Endocervix",
`Colon - Sigmoid` = "Sigmoid colon",
`Colon - Transverse` = "Transverse colon",
`Esophagus - Gastroesophageal Junction` = "Esophagus (GEJ)",
`Esophagus - Mucosa` = "Esophageal mucosa",
`Esophagus - Muscularis` = "Esophagus (Muscularis)",
`Fallopian Tube` = "Fallopian tube",
`Heart - Atrial Appendage` = "Heart (atrial appendage)",
`Heart - Left Ventricle` = "Heart (left ventricle)",
`Kidney - Cortex` = "Kidney",
`Liver` = "Liver",
`Lung` = "Lung",
`Muscle - Skeletal` = "Skeletal muscle",
`Pancreas` = "Pancreas",
`Prostate` = "Prostate",
`Skin - Nut Sun Exposed (Suprapubic)` = "Skin",
`Small Intestine - Terminal Ileum` = "Small intestine",
`Spleen` = "Spleen",
`Stomach` = "Stomach",
`Thyroid` = "Thyroid",
`Uterus` = "Uterus",
`Vagina` = "Vagina", .default = "drop")) %>%
filter(Tissue %in% keep_tissue) %>%
spread(Tissue, mean_RPKM) %>%
separate(unite, into = c("Name", "gene_name"), sep = "=")
## Parsed with column specification:
## cols(
## .default = col_double(),
## Name = col_character(),
## Description = col_character()
## )
## See spec(...) for full column specifications.
Calculate the log2 fold change of expression between brain tissues and other tissues and the dataset.
gtex_fc <- gtex_rpkm
# For each gene, calculate the median RPKM across the tissues in each group
gtex_fc$median_brain_RPKM <- apply(select_(gtex_rpkm, brain), 1, median)
gtex_fc$median_other_RPKM <- apply(select_(gtex_rpkm, other), 1, median)
# Calculate log2 FC with a pseudocount
gtex_fc <- gtex_fc %>%
mutate(FC = (median_brain_RPKM + 1) / (median_other_RPKM + 1),
log_FC = log(FC, base = 2))
chromswitch
genome wideWe evaluate the summary strategy on each of the three input types.
H3K4me3:
val_s_k_raw <- refseq_5kbp %>%
mclapply(safeCallSummary,
peaks = pk_k4me3,
metadata = meta,
mark = "H3K4me3",
filter = TRUE,
filter_columns = c("qValue", "signalValue"),
filter_thresholds = c(10, 4),
normalize = TRUE,
normalize_columns = c("qValue", "signalValue"),
summarize_columns = c("qValue", "signalValue"),
length = FALSE,
fraction = TRUE,
n = FALSE,
heatmap = FALSE,
estimate_state = TRUE,
signal_col = "signalValue",
test_condition = "Brain",
optimal_clusters = TRUE,
BPPARAM = SerialParam(),
mc.cores = n_cores) %>%
purrr::transpose()
val_s_k <- val_s_k_raw$result[map_lgl(val_s_k_raw$error, is_null)] %>%
bind_rows()
save(val_s_k, val_s_k_raw, file = "computations/validation_s_k.RData")
ChromHMM:
val_s_c_raw <- refseq_5kbp %>%
mclapply(safeCallSummary,
peaks = chmm_states,
metadata = meta,
mark = "State",
filter = FALSE,
normalize = FALSE,
summarize_columns = NULL,
length = FALSE,
fraction = TRUE,
n = FALSE,
heatmap = FALSE,
optimal_clusters = TRUE,
estimate_state = TRUE,
signal_col = "fraction",
test_condition = "Brain",
BPPARAM = SerialParam(),
mc.cores = n_cores) %>%
purrr::transpose()
val_s_c <- val_s_c_raw$result[map_lgl(val_s_c_raw$error, is_null)] %>%
bind_rows()
save(val_s_c_raw, val_s_c, file = "computations/validation_s_c.RData")
DNase I:
val_s_d_raw <- refseq_5kbp %>%
mclapply(safeCallSummary,
peaks = pk_dnase,
metadata = meta,
mark = "DNase",
filter = FALSE,
normalize = FALSE,
summarize_columns = NULL,
length = FALSE,
fraction = TRUE,
n = FALSE,
heatmap = FALSE,
estimate_state = TRUE,
signal_col = "fraction",
test_condition = "Brain",
optimal_clusters = TRUE,
BPPARAM = SerialParam(), # Parallelize outside
mc.cores = n_cores) %>%
purrr::transpose()
val_s_d <- val_s_d_raw$result[map_lgl(val_s_d_raw$error, is_null)] %>%
bind_rows()
save(val_s_d_raw, val_s_d, file = "computations/validation_s_d.RData")
H3K4me3:
val_b_h_raw <- refseq_5kbp %>%
mclapply(safeCallBinary,
peaks = pk_k4me3,
metadata = meta,
filter = TRUE,
filter_columns = c("qValue", "signalValue"),
filter_thresholds = c(10, 4),
reduce = TRUE,
gap = 300,
p = 0.4,
heatmap = FALSE,
n_features = TRUE,
estimate_state = TRUE,
test_condition = "Brain",
optimal_clusters = TRUE,
BPPARAM = SerialParam(),
mc.cores = n_cores) %>%
purrr::transpose()
val_b_h <- val_b_h_raw$result[map_lgl(val_b_h_raw$error, is_null)] %>%
bind_rows()
save(val_b_h, val_b_h_raw, file = "computations/validation_b_h.RData")
DNase I:
val_b_d_raw <- refseq_5kbp[1] %>%
mclapply(safeCallBinary,
peaks = pk_dnase,
metadata = meta,
filter = FALSE,
reduce = TRUE,
gap = 300,
p = 0.4,
heatmap = FALSE,
n_features = TRUE,
estimate_state = TRUE,
test_condition = "Brain",
optimal_clusters = TRUE,
BPPARAM = SerialParam(),
mc.cores = n_cores) %>%
purrr::transpose()
val_b_d <- val_b_d_raw$result[map_lgl(val_b_d_raw$error, is_null)] %>%
bind_rows()
save(val_b_d, val_b_d_raw, file = "computations/validation_b_d.RData")
ChromHMM:
val_b_c_raw <- refseq_5kbp %>%
mclapply(safeCallBinary,
peaks = chmm_states,
metadata = meta,
filter = FALSE,
reduce = FALSE,
gap = 300,
p = 0.4,
heatmap = FALSE,
n_features = TRUE,
estimate_state = TRUE,
test_condition = "Brain",
optimal_clusters = TRUE,
BPPARAM = SerialParam(),
mc.cores = n_cores) %>%
purrr::transpose()
val_b_c <- val_b_c_raw$result[map_lgl(val_b_c_raw$error, is_null)] %>%
bind_rows()
save(val_b_c, val_b_c_raw, file = "computations/validation_b_c.RData")
load("computations/validation_s_k.RData")
load("computations/validation_s_d.RData")
load("computations/validation_s_c.RData")
load("computations/validation_b_h.RData")
val_b_k <- val_b_h # Rename this object for consistency
load("computations/validation_b_d.RData")
load("computations/validation_b_c.RData")
First, we will plot the mean RPKM across candidate genes and assess how the distributions differ betwen brain and other tissues.
Define a function to get the candidate switches for each threshold:
joinGtex <- function(val_df) {
val_df %>%
dplyr::select(gene_name, Consensus, state, k) %>%
inner_join(gtex_fc, by = "gene_name")
}
thresholdScores <- function(threshold, scores) {
scores %>%
dplyr::filter(Consensus >= threshold) %>%
mutate(Threshold = threshold)
}
getCandidates <- function(df, method, input) {
thresholds <- c(0.25, 0.5, 0.75)
df <- df %>% filter(state == "ON")
df_thresholded <- map_df(thresholds, thresholdScores, df) %>%
mutate(Threshold = paste0("Consensus >= ", Threshold),
Method = method,
Input = input)
}
val_out <- list(val_s_k, val_s_d, val_s_c,
val_b_k, val_b_d, val_b_c)
eval_methods <- c(rep("Summary", 3), rep("Binary", 3))
eval_input <- rep(c("H3K4me3", "DNase I", "ChromHMM"), 2)
val_gtex <- val_out %>%
mclapply(joinGtex, mc.cores = 2)
val_candidates <- val_gtex %>%
{mcMap(getCandidates, df = ., method = eval_methods, input = eval_input,
mc.cores = 2)} %>%
bind_rows()
val_candidates_long <- val_candidates %>%
gather(Tissue, mean_RPKM, 6:26) %>%
filter(Threshold == "Consensus >= 0.75")
# Count candidates
counts <- ddply(.data = val_candidates,
.(Method, Input, Threshold),
summarize,
# Define a string with the number of candidates which we can plot
n = paste0(unique(Input), "\n(n=", length(gene_name), ")")) %>%
filter(Threshold == "Consensus >= 0.75")
val_candidates_long_clean <- val_candidates_long %>%
mutate(Label = ifelse(Tissue %in% brain, "Brain", "Other")) %>%
mutate(Tissue = factor(Tissue, levels = keep_tissue)) %>%
makeLabelSimple() %T>%
write_tsv("tables/validation_boxplot.tsv")
Figure 1c
val_candidates_long_clean %>%
filter(Tissue %in% keep_tissue) %>%
filter(Input == "H3K4me3") %>%
mutate(Method = ifelse(Method == "Summary", "Summary (n=220)",
"Binary (n=125)")) %>%
ggplot(aes(x = Tissue, y = mean_RPKM)) +
geom_boxplot(width = 0.65, outlier.shape = NA, aes(fill = Method),
position = position_dodge(width = 0.8)) +
scale_fill_manual(values = c("Summary (n=220)" = "red",
"Binary (n=125)" = "#2f60af")) +
facet_wrap(~ Input, ncol = 1) +
ggmin::theme_min() +
coord_cartesian(ylim = c(0, 55)) +
ylab("Mean RPKM") +
theme(text = element_text(size = 10),
axis.text.x = element_text(angle = 90, hjust = 1),
axis.ticks.x = element_blank(),
legend.position = c(0.9, 0.7))
Figure S10
val_candidates_long_clean %>%
ggplot(aes(x = Tissue, y = mean_RPKM)) +
geom_boxplot(outlier.shape = NA, aes(fill = Label), width = 0.65) +
scale_fill_manual(values = pal_strategy) +
ggmin::theme_min() +
facet_grid(Input ~ Method) +
coord_cartesian(ylim = c(0, 55)) +
ylab("Mean RPKM") +
theme(text = element_text(size = 10),
axis.text.x = element_text(angle = 90, hjust = 1, size = 7),
strip.text.y = element_blank(),
legend.position = "none") +
geom_text(data = counts, aes(x = 19, y = 52, label = n), size = 3,
colour = "black", inherit.aes = FALSE, parse = FALSE)
Second, to quantify validation, we will assess how the difference in expression between brain and other tissues associates with the consensus score output by chromswitch
. To do so, we calculate the log2 fold change of the mean RPKM across tissues in each group, across candidate switches at each threshold on the consensus score.
thresholds <- seq(0, 1, by = 0.1)
getFC <- function(threshold, scores) {
scores %>%
dplyr::filter(Consensus >= threshold) %>%
`[[`("log_FC") %>%
mean()
}
val_fc <- data.frame(thresholds)
val_fc$Summary_H3K4me3 <- data.frame(thresholds) %>%
apply(FUN = getFC, MARGIN = 1, val_gtex[[1]])
val_fc$Summary_DNase <- data.frame(thresholds) %>%
apply(FUN = getFC, MARGIN = 1, val_gtex[[2]])
val_fc$Summary_ChromHMM <- data.frame(thresholds) %>%
apply(FUN = getFC, MARGIN = 1, val_gtex[[3]])
val_fc$Binary_H3K4me3 <- data.frame(thresholds) %>%
apply(FUN = getFC, MARGIN = 1, val_gtex[[4]])
val_fc$Binary_DNase <- data.frame(thresholds) %>%
apply(FUN = getFC, MARGIN = 1, val_gtex[[5]])
val_fc$Binary_ChromHMM <- data.frame(thresholds) %>%
apply(FUN = getFC, MARGIN = 1, val_gtex[[6]])
save(val_fc, file = "computations/validation_fc.RData")
val_fc_tidy <- val_fc %>%
gather(strategy, median_logFC, 2:length(.)) %>%
separate(strategy, into = c("Method", "Input"), sep = "_") %>%
mutate(Input = recode(.x = Input, DNase = "DNase I")) %>%
makeLabelSimple() %T>%
write_tsv("tables/validation_foldchange.tsv")
Figure 1d
val_fc_tidy %>%
filter(Input != "DNase I") %>%
ggplot(aes(x = thresholds, y = median_logFC)) +
stat_smooth(aes(colour = Label, linetype = Label), se = FALSE) +
scale_colour_manual(name = "Strategy", values = pal_strategy) +
scale_linetype_manual(name = "Strategy", values = linetype_strategy) +
ggmin::theme_min() +
ylab("Mean log2 fold change of expression") + xlab("Threshold") +
theme(legend.position = "bottom") +
labs(colour = "") + labs(linetype = "") +
guides(colour = guide_legend(ncol = 1))
## `geom_smooth()` using method = 'loess'
Figure S11
val_fc_tidy %>%
ggplot(aes(x = thresholds, y = median_logFC)) +
stat_smooth(aes(colour = Label, linetype = Label), se = FALSE) +
scale_colour_manual(name = "Strategy", values = pal_strategy) +
scale_linetype_manual(name = "Strategy", values = linetype_strategy) +
ggmin::theme_min() +
# ggtitle("Median Log FC of expression across candidate genes") +
ylab("Mean log2 fold change of expression") + xlab("Threshold") +
theme(legend.position = "bottom") +
labs(colour = "") + labs(linetype = "") +
guides(colour = guide_legend(ncol = 1))
## `geom_smooth()` using method = 'loess'
sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
##
## Matrix products: default
## BLAS: /mnt/RICHARDS_JBOD1/KLEINMAN_SCRATCH/modules/software/R/3.4.1/lib64/R/lib/libRblas.so
## LAPACK: /mnt/RICHARDS_JBOD1/KLEINMAN_SCRATCH/modules/software/R/3.4.1/lib64/R/lib/libRlapack.so
##
## locale:
## [1] C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] bindrcpp_0.2
## [2] chromswitch_0.99.0
## [3] stringi_1.1.6
## [4] forcats_0.2.0
## [5] stringr_1.2.0
## [6] dplyr_0.7.4
## [7] readr_1.1.1
## [8] tidyr_0.7.2
## [9] tibble_1.3.4
## [10] tidyverse_1.2.1
## [11] purrr_0.2.4
## [12] plyr_1.8.4
## [13] magrittr_1.5
## [14] BiocParallel_1.10.1
## [15] xtable_1.8-2
## [16] pROC_1.10.0
## [17] biomaRt_2.32.1
## [18] Homo.sapiens_1.3.1
## [19] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [20] org.Hs.eg.db_3.4.1
## [21] GO.db_3.4.1
## [22] OrganismDbi_1.18.1
## [23] GenomicFeatures_1.28.5
## [24] AnnotationDbi_1.38.2
## [25] Biobase_2.36.2
## [26] rtracklayer_1.36.6
## [27] GenomicRanges_1.28.6
## [28] GenomeInfoDb_1.12.3
## [29] IRanges_2.10.3
## [30] S4Vectors_0.14.4
## [31] BiocGenerics_0.22.1
## [32] ggmin_0.0.0.9000
## [33] ggplot2_2.2.1
## [34] RColorBrewer_1.1-2
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-131 bitops_1.0-6
## [3] matrixStats_0.52.2 lubridate_1.7.1
## [5] devtools_1.13.4 bit64_0.9-7
## [7] httr_1.3.1 rprojroot_1.2
## [9] tools_3.4.1 backports_1.1.1
## [11] R6_2.2.2 DBI_0.7
## [13] lazyeval_0.2.1 colorspace_1.3-2
## [15] withr_2.1.0 mnormt_1.5-5
## [17] bit_1.1-12 curl_3.0
## [19] compiler_3.4.1 git2r_0.19.0
## [21] cli_1.0.0 rvest_0.3.2
## [23] graph_1.54.0 xml2_1.1.1
## [25] DelayedArray_0.2.7 labeling_0.3
## [27] scales_0.5.0 psych_1.7.8
## [29] RBGL_1.52.0 digest_0.6.12
## [31] Rsamtools_1.28.0 foreign_0.8-69
## [33] rmarkdown_1.8 XVector_0.16.0
## [35] pkgconfig_2.0.1 htmltools_0.3.6
## [37] readxl_1.0.0 rlang_0.1.4
## [39] rstudioapi_0.7 RSQLite_2.0
## [41] BiocInstaller_1.26.1 bindr_0.1
## [43] jsonlite_1.5 RCurl_1.95-4.8
## [45] GenomeInfoDbData_0.99.0 Matrix_1.2-12
## [47] Rcpp_0.12.13 munsell_0.4.3
## [49] yaml_2.1.15 SummarizedExperiment_1.6.5
## [51] zlibbioc_1.22.0 grid_3.4.1
## [53] blob_1.1.0 crayon_1.3.4
## [55] lattice_0.20-35 Biostrings_2.44.2
## [57] haven_1.1.0 hms_0.4.0
## [59] knitr_1.17 codetools_0.2-15
## [61] reshape2_1.4.2 XML_3.98-1.9
## [63] glue_1.2.0 evaluate_0.10.1
## [65] modelr_0.1.1 cellranger_1.1.0
## [67] gtable_0.2.0 assertthat_0.2.0
## [69] broom_0.4.3 GenomicAlignments_1.12.2
## [71] memoise_1.1.0 cluster_2.0.6