1. Peak Calling

The peak_calling() function identifies enriched genomic signal blocks (peaks) directly from one or more bedGraph files by scanning per-chromosome read coverage, merging contiguous non-zero segments into blocks, estimating local background using proportional flanking windows, and applying an enrichment test with BH correction.

What this function does:

This function reads aligned sequencing reads coverage tracks from bedGraph files and processes each file independently. For each chromosome, it extracts contiguous non-zero coverage intervals and merges adjacent intervals into larger “signal blocks” when the current interval starts exactly at the previous interval end.

For every block, the function calculates a block-level signal AUC (sum of coverage over the block) and estimates a local background AUC by extending upstream and downstream flanks with a length proportional to the block size (4.5× the block length on each side; bounded by chromosome edges). The background region includes the block itself plus both flanks, so the enrichment test is performed by comparing the block AUC against the total AUC in the expanded local neighborhood.

Before statistical testing, an optional AUC pre-filter can be applied via auc_top_pct: only blocks whose AUC falls within the specified top fraction are retained as candidate peaks, reducing noise from low-signal regions and computational overhead in downstream steps.

Enrichment significance is evaluated using a binomial tail probability, where the total “trials” correspond to the local background AUC and the success probability is the length fraction (block_length / bg_length). The resulting p-values are adjusted with BH to obtain q-values. Peaks are filtered by both statistical significance (qvalue_cutoff) and fold change (fc_cutoff), then exported as narrowPeak format (.narrowPeak) with MACS2-compatible columns (signalValue, pValue, qValue, peak), compatible with rtracklayer::import() and standard peak analysis tools.

Parameters

Parameter Type Default Description Example
bedgraph_path character vector Vector of bedGraph file paths. Each bedGraph is processed independently (file-level parallelism), and the function returns a named list of output BED file paths. bedgraph_path = c("C1_H3K27ac.bedGraph", "C1_H3K4me3.bedGraph")
out_dir character "./" Output directory to save peak BED files. Created recursively if it does not exist. out_dir = "./peaks"
ref_genome character "hg38" Reference genome build used to define standard chromosomes and chromosome lengths. Supported values: "hg38", "mm10". Chromosome naming style (UCSC vs NCBI) is auto-detected from the first bedGraph (presence/absence of chr prefix) and applied to the genome object. ref_genome = "mm10"
auc_top_pct numeric 0.1 Fraction of blocks to retain based on AUC ranking prior to statistical testing. Only blocks whose AUC falls within the top fraction are considered as candidate peaks; e.g., The default value 0.1 retains only the top 10% of blocks by AUC. auc_top_pct = 0.1
qvalue_cutoff numeric 0.05 BH-adjusted q-value cutoff for peak significance filtering. qvalue_cutoff = 0.01
fc_cutoff numeric 2 Fold-change cutoff for peak filtering. Fold change is computed as mean signal in the block divided by mean signal in the local background neighborhood: (auc / block_length) / (bg_auc / bg_length). fc_cutoff = 3

Output Files

The function generates the following output file(s) for each input BAM in the specified out_dir:

  1. Peak BED file - <pair>_peaks.narrowPeak
    • <pair> refers to the BAM filename without extension
    • The output is a BED file with a header, compatible with common peak-processing tools, and contains the following columns:
      • chrom: Chromosome name
      • chromStart: 0-based start coordinate
      • chromEnd: 1-based end coordinate (BED end; exclusive semantics are not enforced and correspond to the block end)
      • name: Peak identifier in the format chr:start-end
      • score: Integer score derived from -log10(q_value) (scaled and capped at 1000)
      • strand: .
      • signalValue: Enrichment fold change (fc)
      • pValue: -log10(p_value) (protected from underflow)
      • qValue: -log10(q_value) (protected from underflow)
chrom chromStart chromEnd name score strand signalValue pValue qValue peak
1 chr1 180522 180985 chr1:180522-180985 1000 . 10.0021598272138 307.652655568589 307.652655568589 -1
2 chr1 191187 191864 chr1:191187-191864 1000 . 10.0014771048744 307.652655568589 307.652655568589 -1
3 chr1 268808 269473 chr1:268808-269473 1000 . 7.30879120879121 307.652655568589 307.652655568589 -1
4 chr1 778367 779290 chr1:778367-779290 1000 . 10.0010834236186 307.652655568589 307.652655568589 -1

Example BED output showing peak-calling results for the H3K4me1-H3K4me3 pair in sample C1 from the test dataset.

Example Usage

library(multiEpiCore)

# ===== Test Data =======
bg_dirs <- list.dirs("bedgraph", full.names = TRUE, recursive = FALSE)
out_dir <- "./peak"

for (bg_dir in bg_dirs) {
  sample <- basename(bg_dir)
  sample_out <- file.path(out_dir, sample)
  bg_paths <- list.files(path = bg_dir, pattern = "\\.bedGraph$", recursive = FALSE, full.names = TRUE)
  peak_calling(
    bedgraph_path = bg_paths,
    out_dir = sample_out
  )
}

2. Peak Annotation

2A. Peak-level Pathway Annotation Pipeline

The peak_pathway_annotation() function provides an automated workflow for performing pathway enrichment annotation on peak BED files across multiple targets. It converts each BED file into a GRangesList, runs rGREAT-based pathway enrichment against MSigDB Hallmark gene sets, saves per-target enrichment tables, and optionally generates a combined bubble plot summary.

Parameters

Parameter Type Default Description Example
peak_path character Vector of BED file paths containing peak regions. Each BED file is treated as one target, and the target name is inferred from the file basename after removing pattern (e.g., stripping _peaks.narrowPeak). The BED files must contain at least three columns: chromosome, start, and end. peak_path = c("CTCF_peaks.narrowPeak", "H3K27ac_peaks.narrowPeak")
out_dir character "./" Output directory where pathway enrichment results will be saved. The directory is created automatically if it does not exist. Per-target TSV tables and the optional bubble plot PDF are written here. out_dir = "./peak/sampleA"
ref_genome character "hg38" Reference genome used to select the TSS annotation database for rGREAT. Supported options are "hg38" and "mm10". ref_genome = "mm10"
msigdb_collection character "H" MSigDB collection abbreviation. Common values: "H" (Hallmark, 50 curated biological states), "C2" (curated gene sets including KEGG and Reactome), "C5" (GO gene sets). See msigdbr::msigdbr_collections() for the full list. "C2"
pattern character "_peaks.\\narrowPeak$" Regular expression used to remove suffix text from each BED filename when constructing target names. This must match the naming convention of your peak files; otherwise, target names may be incorrect or non-unique. pattern = "\\.narrowPeak$"
plot logical TRUE Whether to generate a combined bubble plot summarizing pathway enrichment across all targets. If FALSE, only per-target TSV tables are generated. plot = FALSE

Output Files

All output files are saved to the specified out_dir:

  1. Per-target pathway enrichment table - pathway_annotation_<target>.tsv
  • rGREAT enrichment results for each target peak set (one BED file)
  • Contains the following columns:
    • pathway: Name of the enriched MSigDB Hallmark gene set (with the “HALLMARK_” prefix removed).
    • hits_region: Number of input peak regions assigned to genes belonging to this pathway.
    • fold: Fold enrichment of observed region-gene associations compared to the genomic background expectation.
    • p: Raw p-value from the enrichment test assessing whether the association is statistically significant.
    • padj: Multiple-testing adjusted p-value (FDR) correcting for testing across many pathways.
    • hits_gene: Number of unique genes in this pathway that are linked to at least one input peak.
pathway hits_region fold p padj hits_gene
HEME_METABOLISM 751 2.34242482729004 0 0 158
IL6_JAK_STAT3_SIGNALING 279 2.1539487647391 0 0 49
P53_PATHWAY 667 1.96477997521663 0 0 143
UNFOLDED_PROTEIN_RESPONSE 314 1.88450359297116 0 0 75
TGF_BETA_SIGNALING 267 1.78440015179855 0 0 42
DNA_REPAIR 278 1.71262469023883 0 0 90
ESTROGEN_RESPONSE_EARLY 828 1.69192498116218 0 0 154
ESTROGEN_RESPONSE_LATE 665 1.66254073916161 0 0 151
ADIPOGENESIS 530 1.64575217378315 0 0 145
HYPOXIA 734 1.61320155249418 0 0 140
TNFA_SIGNALING_VIA_NFKB 792 1.58554145852615 0 0 152
XENOBIOTIC_METABOLISM 450 1.56279143172028 0 0 127
UV_RESPONSE_UP 443 1.54357603879619 0 0 107
IL2_STAT5_SIGNALING 712 1.53461386388768 0 0 139
MTORC1_SIGNALING 505 1.49494285184175 0 0 123
APICAL_JUNCTION 551 1.463680066491 0 0 139
APOPTOSIS 475 1.48104653987793 3.33066907387547e-16 9.79608551139844e-16 110
MITOTIC_SPINDLE 529 1.36376400697203 4.01345623401994e-12 1.11484895389443e-11 142
PI3K_AKT_MTOR_SIGNALING 302 1.49192024874113 3.21923598889384e-11 8.4716736549838e-11 70
MYOGENESIS 508 1.31864023785508 9.69003322026651e-10 2.42250830506663e-09 141
MYC_TARGETS_V2 119 1.82974169254377 1.20143017756646e-09 2.86054804182492e-09 36
REACTIVE_OXYGEN_SPECIES_PATHWAY 103 1.77523750310684 6.06994279284123e-08 1.37953245291846e-07 34
G2M_CHECKPOINT 513 1.26492214857928 1.23927521800127e-07 2.69407656087232e-07 136
ANDROGEN_RESPONSE 330 1.33614385753243 2.39605753793448e-07 4.9917865373635e-07 75
INTERFERON_GAMMA_RESPONSE 444 1.26193664345416 1.04498334296821e-06 2.08996668593642e-06 109
MYC_TARGETS_V1 320 1.29514823159424 4.42162588376593e-06 8.50312669954987e-06 117
OXIDATIVE_PHOSPHORYLATION 340 1.27782170145938 6.7021178460358e-06 1.24113293445107e-05 132
PEROXISOME 227 1.34919819212654 8.78569769868776e-06 1.56887458905138e-05 67
GLYCOLYSIS 493 1.21410566576122 1.33914178344074e-05 2.30886514386335e-05 139
COAGULATION 274 1.2900533590942 2.61656722224668e-05 4.36094537041113e-05 81
E2F_TARGETS 340 1.22929930562074 0.000113631696623262 0.000183276930037519 123
PROTEIN_SECRETION 243 1.21740811782978 0.00152880417435064 0.00238875652242287 58
COMPLEMENT 438 1.13378486167835 0.00491815778703741 0.00745175422278396 126
INFLAMMATORY_RESPONSE 499 1.11258543831093 0.00935157469816006 0.0137523157325883 123
INTERFERON_ALPHA_RESPONSE 158 1.18824427654253 0.0184387753694073 0.0263411076705819 49
APICAL_SURFACE 142 1.19709034950041 0.0197614340616903 0.0274464361967921 31
NOTCH_SIGNALING 99 1.23636420287506 0.0222803736236655 0.0301086130049534 20
ALLOGRAFT_REJECTION 359 1.09140387222255 0.0517507656437572 0.0680931126891543 101
KRAS_SIGNALING_DN 455 1.07232300390841 0.0706418972348002 0.0905665349164105 117
CHOLESTEROL_HOMEOSTASIS 155 1.12700521636616 0.0753616524855787 0.0942020656069734 47
WNT_BETA_CATENIN_SIGNALING 140 1.1286634877976 0.0839096466534721 0.102328837382283 30
FATTY_ACID_METABOLISM 279 1.08475876599981 0.0920066254830114 0.109531697003585 91
ANGIOGENESIS 104 1.12332164952526 0.128663088630856 0.149608242594018 18
HEDGEHOG_SIGNALING 145 0.969022723736511 0.659136328916236 0.749018555586632 30
SPERMATOGENESIS 258 0.90177989643402 0.957419784368706 1 79
UV_RESPONSE_DN 554 0.924360737401476 0.972326175648468 1 95
BILE_ACID_METABOLISM 188 0.853630810616856 0.988205830454095 1 61
KRAS_SIGNALING_UP 508 0.870884862971025 0.999415085430068 1 119
PANCREAS_BETA_CELLS 81 0.685957113608895 0.999876069754364 1 26
EPITHELIAL_MESENCHYMAL_TRANSITION 442 0.697646781527174 1 1 112

Note: This table shows the pathway enrichment annotation results for peaks from H3K4me1-H3K4me1 in the C1 sample.

  1. Cross-target bubble plot - pathway_annotation.pdf (when plot = TRUE)
  • A combined summary visualization across all targets and pathways
  • x-axis: target (BED-derived name)
  • y-axis: pathway
  • point color: capped -log10(padj) for visual stability
  • point size: log2(1 + fold_enrichment)

Example Usage

library(multiEpiCore)

# ===== Test Data =======
pk_dirs <- list.dirs("peak", full.names = TRUE, recursive = FALSE)

for (pk_dir in pk_dirs) {
  sample <- basename(pk_dir)
  sample_out <- file.path(pk_dir, "pathway_annotation")
  peak_paths <- list.files(path = pk_dir, pattern = "\\.narrowPeak$", recursive = FALSE, full.names = TRUE)
  peak_pathway_annotation(
    peak_path = peak_paths,
    out_dir = sample_out
  )
}

2B. Peak-level Genomic Distribution Pipeline

The peak_genomic_distribution() function performs genomic feature annotation for peak BED files and quantifies their distribution across regulatory element categories. Each peak set is converted into genomic ranges and annotated against selected reference resources such as gene features, cCRE elements, ChromHMM chromatin states, and repeat elements, followed by optional visualization of distribution patterns.

Parameters

Parameter Type Default Description Example
bed_path character Path(s) to input BED files, or a directory containing BED files. Each BED file is treated as one target, and target names are extracted from filenames using pattern. c("CTCF_peaks.narrowPeak", "H3K27ac_peaks.narrowPeak")
out_dir character "./" Output directory for annotation results. Results for each annotation type and target are saved here. "./distribution_results"
pattern character "_peaks\\.narrowPeak$" Regular expression used to remove suffix text from each BED filename when constructing target names. This must match the naming convention of your peak files; otherwise, target names may be incorrect or non-unique. pattern = "\\.narrowPeak$"
distributions character vector c("genic", "ccre") Annotation types to compute. Supported options are "genic", "ccre", "chromhmm", and "repeat". At least one annotation type must be specified. c("genic", "ccre", "repeat")
ref_genome character "hg38" Reference genome version used for annotation. Supported options are "hg38" and "mm10". "mm10"
ref_source character "knownGene" Gene annotation source for genic-based annotation. Supported options are "knownGene" (UCSC) and "GENCODE". "GENCODE"
mode character "nearest" Method used to assign peaks to genomic features. "nearest" assigns each peak to the closest feature, while "weighted" assigns annotations proportionally based on overlap length. "weighted"
plot logical TRUE Whether to generate visualization plots summarizing genomic distribution across annotation categories. FALSE
  • Annotation types available:
    • "genic": Gene structural features - Promoter, 5’ UTR, Exon, Intron, 3’ UTR
    • "ccre": Candidate cis-Regulatory Elements - dELS, pELS, PLS, CA-H3K4me3, CA-CTCF, CA-TF, TF, CA
    • "chromhmm": Chromatin states - Acet, EnhWk, EnhA, PromF, TSS, TxWk, TxEx, Tx, OpenC, TxEnh, ReprPCopenC, BivProm, ZNF, ReprPC, HET, GapArtf, Quies
    • "repeat": Repetitive elements - SINE, LINE, LTR, Retroposon, RC, DNA, Satellite, Simple_repeat, Low_complexity, rRNA, tRNA, snRNA, scRNA, srpRNA, RNA, Unknown
    • For detailed description of each annotation category, see the Annotation page
  • Annotation assignment methods:
    • "nearest": Each genomic region is assigned to its closest feature (by distance to feature center)
    • "weighted": Each region is proportionally assigned to overlapping features based on overlap length
    • nearest mode is faster and simpler; weighted mode provides more accurate representation for regions spanning multiple features

Output Files

The function generates the following output files in the specified out_dir: 1. Composition tables - {genic/ccre/chromhmm/repeat}_distribution.tsv - .tsv tables summarizing the percentage composition of various annotations - Row: cluster lables - Col: annotations states - Values represent the proportion (%) of genomic regions assigned to each state within a cluster

  • Genic distribution
Promoter 5’ UTR Exon Intron 3’ UTR other
H3K27me3-H3K4me1 13.55 1.39 7.16 46.07 2.37 29.46
H3K27me3-H3K4me3 25.88 3.98 9.22 38.76 2.14 20.02
H3K27me3-H3K9me2 8.96 0.96 5.57 49.88 2.05 32.58
H3K27me3-H3K9me3 10.61 1.04 6.17 47.40 2.06 32.72
……
H3K9me3-H3K9me3 6.79 0.70 4.30 45.85 1.41 40.95
  • cCRE distribution
dELS pELS PLS CA-H3K4me3
H3K27me3-H3K4me1 50.47 13.57 8.13 2.40
H3K27me3-H3K4me3 35.62 21.39 21.62 2.01
H3K27me3-H3K9me2 39.89 9.36 4.06 2.66
H3K27me3-H3K9me3 40.86 9.89 5.74 3.17
……
H3K9me3-H3K9me3 26.82 5.78 3.51 4.54
CA-CTCF CA-TF TF CA other
H3K27me3-H3K4me1 2.43 0.86 2.18 4.41 15.55
H3K27me3-H3K4me3 1.50 0.47 1.62 2.99 12.79
H3K27me3-H3K9me2 2.79 0.78 2.58 6.06 31.82
H3K27me3-H3K9me3 2.82 0.82 2.64 5.86 28.21
……
H3K9me3-H3K9me3 2.98 0.77 3.02 6.78 45.79

Note: These tables show the genomic distribution results for peaks from H3K4me1-H3K4me1 in the C1 sample.

  1. Cmposition bar plot - {genic/ccre/chromhmm/repeat}_distribution.pdf
    • Stacked horizontal bar plot showing ChromHMM chromatin state distribution across clusters
    • X-axis: Percentage (0-100%)
    • Y-axis: Cluster labels (top to bottom)

Note: These figurs show the genomic distribution results for peaks from H3K4me1-H3K4me1 in the C1 sample.

Example Usage

library(multiEpiCore)
# Test Data
pk_dirs <- list.dirs("peak", full.names = TRUE, recursive = FALSE)
for (pk_dir in pk_dirs) {
  sample <- basename(pk_dir)
  sample_out <- file.path(pk_dir, "genomic_distribution")
  peak_paths <- list.files(path = pk_dir, pattern = "\\.narrowPeak$", recursive = FALSE, full.names = TRUE)
  peak_genomic_distribution(
    peak_path = peak_paths,
    out_dir = sample_out
  )
}

3. Peak Differential Analysis

Wrapper Function

Unlike other steps in this package where each function operates across all CRF pairs simultaneously, the differential analysis here is structured per CRF pair across samples. This reflects the biological reality that each CRF pair targets a distinct genomic locus set — peak regions from different CRF pairs are not directly comparable and should never be pooled into a shared count matrix.

peak_differential_analysis() is the recommended entry point for this step. It wraps build_peak_set() and differential_regions_single_peak() into a single call: for each CRF pair found consistently across all sample directories, it first constructs a condition-aware consensus peak set, then counts fragment overlap per sample and runs the voom-limma differential analysis.

### Parameters
Parameter Type Default Description Example
peak_dirs character vector One directory per sample containing narrowPeak files. Must be in the same order as conditions. Each directory is scanned for files matching peak_pattern; the pair name is extracted by stripping the pattern from the filename. peak_dirs = c("./peak/C1", "./peak/C2", "./peak/T1", "./peak/T2")
bam_dirs character vector One directory per sample containing BAM files, in the same order as peak_dirs and conditions. Pair names are extracted by stripping bam_pattern from the filename; only pairs present across all BAM and peak directories are processed. bam_dirs = c("./bam/C1", "./bam/C2", "./bam/T1", "./bam/T2")
conditions character vector Condition label for each sample, in the same order as peak_dirs and bam_dirs. Each condition must have at least 2 replicates. conditions = c("C", "C", "T", "T")
sample_names character vector or NULL NULL Optional display names for each sample, in the same order as conditions. If NULL, names are auto-generated as {condition}{replicate_index}. sample_names = c("C1", "C2", "T1", "T2")
ref_genome character "hg38" Reference genome used to obtain chromosome sizes for boundary checking during tiling. Must be one of "hg38" or "mm10". Ignored when window_size is NULL. ref_genome = "mm10"
out_dir character "./" Root output directory. Consensus BED files are written to {out_dir}/peak_sets/ and differential results to {out_dir}/differential/. out_dir = "./results"
bam_pattern character "\\.bam$" Regular expression used to remove suffix text from each BAM filename when constructing target names. This must match the naming convention of your BAM files and produce the same target names as peak_pattern; otherwise, pair matching across peak and BAM directories will fail. bam_pattern = "\\.bam$"
peak_pattern character "_peaks\\.narrowPeak$" Regex pattern used to scan peak_dirs and to strip the suffix when extracting pair names from filenames. peak_pattern = "_peaks\\.narrowPeak$"
window_size integer or NULL NULL Passed to build_peak_set(). If NULL, master regions are used directly. If a positive integer, regions are tiled into fixed-size windows. window_size = 500L
min_support integer 2 Passed to both build_peak_set() and differential_regions_single_peak(). Minimum number of samples within at least one condition group required to retain a region at the peak-building stage, and to pass the non-zero filter at the testing stage. min_support = 2
lfc_threshold numeric 0.5 Minimum absolute log2 fold-change for significance. lfc_threshold = 0.5
fdr_threshold numeric 0.05 Adjusted p-value (BH) threshold for significance. fdr_threshold = 0.05

Output Structure

{out_dir}/
├── peak_sets/
│   ├── {pair}.bed          # consensus peak set per CRF pair
│   └── ...
└── differential/
    ├── {test}_vs_{ref}_{pair}_all.tsv
    ├── {test}_vs_{ref}_{pair}_sig.tsv
    ├── {test}_vs_{ref}_summary.tsv
    ├── {test}_vs_{ref}_summary.pdf
    └── ...

Example Usage

peak_differential_analysis(
  peak_dirs  = c("./peak/C1", "./peak/C2", "./peak/T1", "./peak/T2"),
  bam_dirs   = c("./bam/C1",  "./bam/C2",  "./bam/T1",  "./bam/T2"),
  conditions = c("C", "C", "T", "T"),
  sample_names = c("C1", "C2", "T1", "T2"),
  out_dir    = "./peak",
  min_support  = 2,
  window_size  = NULL,
  lfc_threshold = 0.5,
  fdr_threshold = 0.05
)

3A. Build Peak Set

Given a set of narrowPeak files from the same target(CRF pair) across multiple samples, build_peak_set() function constructs a unified peak set that represents regions consistently enriched across the cohort.

What this function does:

First, all per-sample peak intervals are pooled together. The genome is then atomized into the smallest non-overlapping sub-intervals defined by all peak boundaries, and each sub-interval is scored by how many distinct samples contain it. Sub-intervals supported by fewer than min_support samples are discarded as sample-specific noise; the remaining sub-intervals are merged into contiguous consensus blocks wherever they are adjacent.

  • If window_size is NULL, these master regions are exported directly as the final peak set (each region retains its natural variable width).
  • If window_size is specified, each master region is divided into fixed-size windows tiled outward from the region’s center. The number of windows is ceil(region_width / window_size). For an odd number of windows, the center window is anchored at the region midpoint; for an even number, the midpoint falls at the boundary between the two central windows. Windows that extend beyond chromosome boundaries are discarded.

Parameters

Parameter Type Default Description Example
peak_path character vector Vector of narrowPeak or BED file paths, one per sample. Each file must contain at least three columns: chrom, chromStart, chromEnd. peak_path = c(“S1_peaks.narrowPeak”, “S2_peaks.narrowPeak”)
pair character Output file basename (without extension). The final BED file is written as <pair>.bed inside out_dir. pair = “H3K27ac_consensus”
conditions character vector Condition label for each sample, in the same order as peak_path. Used together with min_support to apply a per-group reproducibility filter. conditions = c(“ctrl”, “ctrl”, “treat”, “treat”)
ref_genome character “hg38” Reference genome used to obtain chromosome sizes for boundary checking during tiling. Must be one of “hg38” or “mm10”. Ignored when window_size is NULL. ref_genome = “mm10”
out_dir character “./” Output directory. Created recursively if it does not exist. out_dir = “./peak_sets”
window_size integer or NULL NULL If NULL, master regions are exported directly. If a positive integer, each master region is tiled into fixed-size windows of this width. Regions narrower than window_size are skipped. window_size = 500L
min_support integer 2 Minimum number of distinct samples within a single condition group required to retain an atomic sub-interval. A sub-interval is kept if at least one condition group meets this threshold, so condition-specific but reproducible peaks are not discarded. min_support = 3

Output Files

  1. Consensus BED file - {pair}.bed
  • Headerless, tab-separated, 0-based half-open coordinates
  • Three columns: chrom, chromStart, chromEnd

Example Usage

library(multiEpiCore)
library(stringr)

# =============== Test data ===============
peak_dirs <- c("./peak/C1", "./peak/C2", "./peak/T1", "./peak/T2")
conditions <- c("C", "C", "T", "T")
pattern   <- "_peaks\\.narrowPeak$"
out_dir   <- "./peak/peak_sets"

# collect peak files and pair names per directory
dir_peaks <- lapply(peak_dirs, function(d) {
  paths <- list.files(d, pattern = pattern, full.names = TRUE)
  pairs <- basename(paths) |> str_replace(pattern, "")
  setNames(paths, pairs)
})

# retain only pair names present in ALL directories
common_pairs <- Reduce(intersect, lapply(dir_peaks, names))
message(length(common_pairs), " pairs found across all samples.")

# build peak set for each common pair
peak_set_paths <- lapply(common_pairs, function(pair) {
  peak_path <- sapply(dir_peaks, function(x) x[[pair]])
  build_peak_set(
    peak_path   = peak_path,
    conditions  = conditions,
    pair        = pair,
    out_dir     = out_dir,
    window_size = NULL,
    min_support = 2
  )
})
names(peak_set_paths) <- common_pairs

To restrict the peak set to specific genomic contexts, the output BED file can be further filtered or trimmed using bedtools intersect:

# Filter peak_set to regions overlapping a set of interest
# retaining the original peak_set intervals
bedtools intersect \
  -a peak_set.bed \
  -b regions_of_interest.bed \
  -u \
> peak_set.overlap_only.bed

# Restrict peak_set to only the bases shared with regions_of_interest,
# collapsing all overlapping region entries into a single merged interval per peak (-u).
bedtools intersect \
  -a peak_set.bed \
  -b regions_of_interest.bed \
  -u \
| bedtools merge \
> peak_set.overlap_trimmed.bed

# Trim each peak_set interval to the exact overlapping segments with regions_of_interest
# splitting a single peak into multiple shorter intervals if it overlaps multiple regions.
bedtools intersect \
  -a peak_set.bed \
  -b regions_of_interest.bed \
> peak_set.overlap_segments.bed

3B. Differential Region Analysis

Given a consensus peak set (e.g. produced by build_peak_set()) and a set of BAM files from the same target (CRF pair) across multiple samples, differential_regions_single_peak() counts fragment overlap per region per sample, then identifies differentially accessible regions (DARs) between conditions using a two-round voom-limma framework.

What this function does:

For each BAM file, paired-end fragments are reconstructed from read pairs and their overlap with each consensus region is quantified as a proportional count. The resulting region × sample count matrix is passed to a two-round voom + limma pipeline: a first voom pass is used to filter low-expression regions by mean expression quantile, and a second voom pass fits the final linear model. Empirical Bayes shrinkage (eBayes) is applied and all pairwise condition comparisons are tested. Results are written to disk as full result tables, significant DAR tables, and per-comparison summary files.

Parameters

Parameter Type Default Description Example
bam_path character vector Vector of BAM file paths, one per sample, all from the same CRF pair. Must be in the same order as conditions and sample_names. Each BAM must be coordinate-sorted and indexed. bam_path = c(“C1/CRF1.bam”, “C2/CRF1.bam”, “T1/CRF1.bam”, “T2/CRF1.bam”)
conditions character vector Condition label for each sample, in the same order as bam_path. Each condition must have at least 2 replicates. conditions = c(“C”, “C”, “T”, “T”)
regions character Path to a BED file defining the consensus regions to test. Typically the output of build_peak_set(). Must be a headerless, tab-separated file with columns chrom, chromStart (0-based), chromEnd. regions = “./peak_sets/CRF1.bed”
pair character Name of the CRF pair being analyzed. Used as a prefix in all output filenames. pair = “CRF1”
sample_names character vector or NULL NULL Optional display names for each sample, in the same order as bam_path. If NULL, names are auto-generated as {condition}{replicate_index}. sample_names = c(“C1”, “C2”, “T1”, “T2”)
out_dir character “./” Output directory. Created recursively if it does not exist. out_dir = “./differential/CRF1”
min_support integer 2 Minimum number of samples with non-zero counts within at least one condition group for a region to be tested. min_support = 2
mean_quantile numeric 0 Regions with mean log2 expression below this quantile (from the first voom pass) are removed before the second voom fit. mean_quantile = 0.1
lfc_threshold numeric 0.5 Minimum absolute log2 fold-change for a region to be called significant. lfc_threshold = 0.5
fdr_threshold numeric 0.05 Adjusted p-value (BH) threshold for significance. fdr_threshold = 0.05

Output Files

All files are written to out_dir, organized by comparison tag (in the form {test}_vs_{ref}) read from the conditions, with pair or cluster name as a subdirectory prefix.

  1. Full result table<comparison>/<pair>_all.tsv
    • All tested regions with columns: pos, logFC, AveExpr, t, P.Value, adj.P.Val, B
pos logFC AveExpr t P.Value adj.P.Val B
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
chr1_161083373_161084202 3.9289 6.4669 10.5248 9.49e-12 1.41e-07 16.7132
chr1_205229920_205230422 -4.0230 5.6295 -9.2274 2.14e-10 1.59e-06 13.6221
chr19_51417328_51417981 3.3961 5.8269 8.7129 7.83e-10 3.87e-06 12.4792
chr22_19447971_19448534 -3.8683 5.5171 -8.4705 1.46e-09 5.41e-06 11.7937
chr20_51360094_51360812 3.2724 5.7498 8.0526 4.35e-09 1.11e-05 10.8309

Note: This table shows differential analysis results for H3K4me1-H3K4me1 in the test dataset.

  1. Significant DAR table<comparison>/<pair>_sig.tsv
    • Subset of regions passing both fdr_threshold and lfc_threshold
  2. Summary table<comparison>/summary.tsv
    • One row per pair with counts of tested regions, significant DARs, and up/down split
group n_rows_before n_rows_after_nonzero n_rows_after_mean n_sig n_up n_down
<chr> <int> <int> <int> <int> <int> <int>
H3K9me2-H3K9me3 624 613 613 89 39 50
H3K4me1-H3K9me3 7406 7368 7368 0 0 0
H3K9me3-H3K9me3 17495 17417 17417 1408 614 794
H3K27me3-H3K9me3 18828 18751 18751 1358 653 705
H3K4me3-H3K4me3 15631 15630 15630 668 443 225
H3K4me1-H3K4me1 14844 14836 14836 2235 1166 1069
H3K27me3-H3K4me1 39047 39046 39046 2179 1058 1121
H3K4me3-H3K9me3 12988 12982 12982 31 17 14
H3K4me1-H3K4me3 15595 15594 15594 250 165 85
H3K27me3-H3K4me3 17133 17124 17124 59 46 13
H3K27me3-H3K9me2 728 710 710 140 32 108

Note: Rows are appended across calls sharing the same out_dir, so results from multiple pairs within the same comparison are accumulated into a single file.

Example Usage

library(multiEpiCore)
library(stringr)

# =============== Test data ===============

bam_dirs   <- c("./bam/C1", "./bam/C2", "./bam/T1", "./bam/T2")
conditions <- c("C", "C", "T", "T")
pattern    <- "\\.bam$"
peak_set_dir <- "./peak/peak_sets"   # output of build_peak_set()
out_dir      <- "./peak/differential"

# collect BAM files and pair names per directory
dir_bams <- lapply(bam_dirs, function(d) {
  paths <- list.files(d, pattern = pattern, full.names = TRUE)
  setNames(paths, basename(paths) |> str_replace(pattern, ""))
})

# retain only pairs present in ALL directories
common_pairs <- Reduce(intersect, lapply(dir_bams, names))
message(length(common_pairs), " pairs found across all samples.")
if (length(common_pairs) == 0)
  stop("No common pairs found. Check BAM filenames and directories.")

# run differential analysis for each common pair
results <- lapply(common_pairs, function(pair) {
  bam_path <- sapply(dir_bams, function(x) x[[pair]])
  differential_regions_single_peak(
    bam_path      = bam_path,
    conditions    = conditions,
    regions       = file.path(peak_set_dir, paste0(pair, ".bed")),
    pair          = pair,
    out_dir       = out_dir,
    min_support   = 2,
    mean_quantile = 0,
    lfc_threshold = 0.5,
    fdr_threshold = 0.05
  )
})
names(results) <- common_pairs