The peak_calling() function identifies enriched genomic
signal blocks (peaks) directly from one or more bedGraph files by
scanning per-chromosome read coverage, merging contiguous non-zero
segments into blocks, estimating local background using proportional
flanking windows, and applying an enrichment test with BH
correction.
What this function does:
This function reads aligned sequencing reads coverage tracks from bedGraph files and processes each file independently. For each chromosome, it extracts contiguous non-zero coverage intervals and merges adjacent intervals into larger “signal blocks” when the current interval starts exactly at the previous interval end.
For every block, the function calculates a block-level signal AUC (sum of coverage over the block) and estimates a local background AUC by extending upstream and downstream flanks with a length proportional to the block size (4.5× the block length on each side; bounded by chromosome edges). The background region includes the block itself plus both flanks, so the enrichment test is performed by comparing the block AUC against the total AUC in the expanded local neighborhood.
Before statistical testing, an optional AUC pre-filter can be applied
via auc_top_pct: only blocks whose AUC falls within the
specified top fraction are retained as candidate peaks, reducing noise
from low-signal regions and computational overhead in downstream
steps.
Enrichment significance is evaluated using a binomial tail
probability, where the total “trials” correspond to the local background
AUC and the success probability is the length fraction
(block_length / bg_length). The resulting p-values are
adjusted with BH to obtain q-values. Peaks are filtered by both
statistical significance (qvalue_cutoff) and fold change
(fc_cutoff), then exported as narrowPeak
format (.narrowPeak) with MACS2-compatible columns
(signalValue, pValue, qValue,
peak), compatible with rtracklayer::import()
and standard peak analysis tools.
| Parameter | Type | Default | Description | Example |
|---|---|---|---|---|
bedgraph_path |
character vector | — | Vector of bedGraph file paths. Each bedGraph is processed independently (file-level parallelism), and the function returns a named list of output BED file paths. | bedgraph_path = c("C1_H3K27ac.bedGraph", "C1_H3K4me3.bedGraph") |
out_dir |
character | "./" |
Output directory to save peak BED files. Created recursively if it does not exist. | out_dir = "./peaks" |
ref_genome |
character | "hg38" |
Reference genome build used to define standard chromosomes and chromosome lengths.
Supported values: "hg38", "mm10".
Chromosome naming style (UCSC vs NCBI) is auto-detected from the first bedGraph
(presence/absence of chr prefix) and applied to the genome object.
|
ref_genome = "mm10" |
auc_top_pct |
numeric | 0.1 |
Fraction of blocks to retain based on AUC ranking prior to statistical testing. Only blocks whose AUC falls within the top fraction are considered as candidate peaks; e.g., The default value 0.1 retains only the top 10% of blocks by AUC.
|
auc_top_pct = 0.1 |
qvalue_cutoff |
numeric | 0.05 |
BH-adjusted q-value cutoff for peak significance filtering. | qvalue_cutoff = 0.01 |
fc_cutoff |
numeric | 2 |
Fold-change cutoff for peak filtering. Fold change is computed as mean signal in the block divided by mean
signal in the local background neighborhood:
(auc / block_length) / (bg_auc / bg_length).
|
fc_cutoff = 3 |
The function generates the following output file(s) for each input
BAM in the specified out_dir:
<pair>_peaks.narrowPeak
<pair> refers to the BAM filename without extension
chrom: Chromosome name
chromStart: 0-based start coordinate
chromEnd: 1-based end coordinate (BED end; exclusive
semantics are not enforced and correspond to the block end)
name: Peak identifier in the format
chr:start-end
score: Integer score derived from
-log10(q_value) (scaled and capped at 1000)
strand: .
signalValue: Enrichment fold change (fc)
pValue: -log10(p_value) (protected from
underflow)
qValue: -log10(q_value) (protected from
underflow)
| chrom | chromStart | chromEnd | name | score | strand | signalValue | pValue | qValue | peak | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | chr1 | 180522 | 180985 | chr1:180522-180985 | 1000 | . | 10.0021598272138 | 307.652655568589 | 307.652655568589 | -1 |
| 2 | chr1 | 191187 | 191864 | chr1:191187-191864 | 1000 | . | 10.0014771048744 | 307.652655568589 | 307.652655568589 | -1 |
| 3 | chr1 | 268808 | 269473 | chr1:268808-269473 | 1000 | . | 7.30879120879121 | 307.652655568589 | 307.652655568589 | -1 |
| 4 | chr1 | 778367 | 779290 | chr1:778367-779290 | 1000 | . | 10.0010834236186 | 307.652655568589 | 307.652655568589 | -1 |
Example BED output showing peak-calling results for the
H3K4me1-H3K4me3 pair in sample C1 from the
test dataset.
library(multiEpiCore)
# ===== Test Data =======
bg_dirs <- list.dirs("bedgraph", full.names = TRUE, recursive = FALSE)
out_dir <- "./peak"
for (bg_dir in bg_dirs) {
sample <- basename(bg_dir)
sample_out <- file.path(out_dir, sample)
bg_paths <- list.files(path = bg_dir, pattern = "\\.bedGraph$", recursive = FALSE, full.names = TRUE)
peak_calling(
bedgraph_path = bg_paths,
out_dir = sample_out
)
}
The peak_pathway_annotation() function provides an
automated workflow for performing pathway enrichment annotation on peak
BED files across multiple targets. It converts each BED file into a
GRangesList, runs rGREAT-based pathway
enrichment against MSigDB Hallmark gene sets, saves per-target
enrichment tables, and optionally generates a combined bubble plot
summary.
| Parameter | Type | Default | Description | Example |
|---|---|---|---|---|
peak_path |
character | — |
Vector of BED file paths containing peak regions.
Each BED file is treated as one target, and the target name is inferred
from the file basename after removing pattern
(e.g., stripping _peaks.narrowPeak).
The BED files must contain at least three columns:
chromosome, start, and end.
|
peak_path = c("CTCF_peaks.narrowPeak", "H3K27ac_peaks.narrowPeak") |
out_dir |
character | "./" |
Output directory where pathway enrichment results will be saved. The directory is created automatically if it does not exist. Per-target TSV tables and the optional bubble plot PDF are written here. | out_dir = "./peak/sampleA" |
ref_genome |
character | "hg38" |
Reference genome used to select the TSS annotation database for rGREAT.
Supported options are "hg38" and "mm10".
|
ref_genome = "mm10" |
msigdb_collection |
character | "H" |
MSigDB collection abbreviation. Common values: "H" (Hallmark, 50 curated biological states), "C2" (curated gene sets including KEGG and Reactome), "C5" (GO gene sets). See msigdbr::msigdbr_collections() for the full list. |
"C2" |
pattern |
character | "_peaks.\\narrowPeak$" |
Regular expression used to remove suffix text from each BED filename when constructing target names. This must match the naming convention of your peak files; otherwise, target names may be incorrect or non-unique. | pattern = "\\.narrowPeak$" |
plot |
logical | TRUE |
Whether to generate a combined bubble plot summarizing pathway enrichment
across all targets.
If FALSE, only per-target TSV tables are generated.
|
plot = FALSE |
All output files are saved to the specified out_dir:
pathway_annotation_<target>.tsvpathway: Name of the enriched MSigDB Hallmark gene set
(with the “HALLMARK_” prefix removed).hits_region: Number of input peak regions assigned to
genes belonging to this pathway.fold: Fold enrichment of observed region-gene
associations compared to the genomic background expectation.p: Raw p-value from the enrichment test assessing
whether the association is statistically significant.padj: Multiple-testing adjusted p-value (FDR)
correcting for testing across many pathways.hits_gene: Number of unique genes in this pathway that
are linked to at least one input peak.| pathway | hits_region | fold | p | padj | hits_gene |
|---|---|---|---|---|---|
| HEME_METABOLISM | 751 | 2.34242482729004 | 0 | 0 | 158 |
| IL6_JAK_STAT3_SIGNALING | 279 | 2.1539487647391 | 0 | 0 | 49 |
| P53_PATHWAY | 667 | 1.96477997521663 | 0 | 0 | 143 |
| UNFOLDED_PROTEIN_RESPONSE | 314 | 1.88450359297116 | 0 | 0 | 75 |
| TGF_BETA_SIGNALING | 267 | 1.78440015179855 | 0 | 0 | 42 |
| DNA_REPAIR | 278 | 1.71262469023883 | 0 | 0 | 90 |
| ESTROGEN_RESPONSE_EARLY | 828 | 1.69192498116218 | 0 | 0 | 154 |
| ESTROGEN_RESPONSE_LATE | 665 | 1.66254073916161 | 0 | 0 | 151 |
| ADIPOGENESIS | 530 | 1.64575217378315 | 0 | 0 | 145 |
| HYPOXIA | 734 | 1.61320155249418 | 0 | 0 | 140 |
| TNFA_SIGNALING_VIA_NFKB | 792 | 1.58554145852615 | 0 | 0 | 152 |
| XENOBIOTIC_METABOLISM | 450 | 1.56279143172028 | 0 | 0 | 127 |
| UV_RESPONSE_UP | 443 | 1.54357603879619 | 0 | 0 | 107 |
| IL2_STAT5_SIGNALING | 712 | 1.53461386388768 | 0 | 0 | 139 |
| MTORC1_SIGNALING | 505 | 1.49494285184175 | 0 | 0 | 123 |
| APICAL_JUNCTION | 551 | 1.463680066491 | 0 | 0 | 139 |
| APOPTOSIS | 475 | 1.48104653987793 | 3.33066907387547e-16 | 9.79608551139844e-16 | 110 |
| MITOTIC_SPINDLE | 529 | 1.36376400697203 | 4.01345623401994e-12 | 1.11484895389443e-11 | 142 |
| PI3K_AKT_MTOR_SIGNALING | 302 | 1.49192024874113 | 3.21923598889384e-11 | 8.4716736549838e-11 | 70 |
| MYOGENESIS | 508 | 1.31864023785508 | 9.69003322026651e-10 | 2.42250830506663e-09 | 141 |
| MYC_TARGETS_V2 | 119 | 1.82974169254377 | 1.20143017756646e-09 | 2.86054804182492e-09 | 36 |
| REACTIVE_OXYGEN_SPECIES_PATHWAY | 103 | 1.77523750310684 | 6.06994279284123e-08 | 1.37953245291846e-07 | 34 |
| G2M_CHECKPOINT | 513 | 1.26492214857928 | 1.23927521800127e-07 | 2.69407656087232e-07 | 136 |
| ANDROGEN_RESPONSE | 330 | 1.33614385753243 | 2.39605753793448e-07 | 4.9917865373635e-07 | 75 |
| INTERFERON_GAMMA_RESPONSE | 444 | 1.26193664345416 | 1.04498334296821e-06 | 2.08996668593642e-06 | 109 |
| MYC_TARGETS_V1 | 320 | 1.29514823159424 | 4.42162588376593e-06 | 8.50312669954987e-06 | 117 |
| OXIDATIVE_PHOSPHORYLATION | 340 | 1.27782170145938 | 6.7021178460358e-06 | 1.24113293445107e-05 | 132 |
| PEROXISOME | 227 | 1.34919819212654 | 8.78569769868776e-06 | 1.56887458905138e-05 | 67 |
| GLYCOLYSIS | 493 | 1.21410566576122 | 1.33914178344074e-05 | 2.30886514386335e-05 | 139 |
| COAGULATION | 274 | 1.2900533590942 | 2.61656722224668e-05 | 4.36094537041113e-05 | 81 |
| E2F_TARGETS | 340 | 1.22929930562074 | 0.000113631696623262 | 0.000183276930037519 | 123 |
| PROTEIN_SECRETION | 243 | 1.21740811782978 | 0.00152880417435064 | 0.00238875652242287 | 58 |
| COMPLEMENT | 438 | 1.13378486167835 | 0.00491815778703741 | 0.00745175422278396 | 126 |
| INFLAMMATORY_RESPONSE | 499 | 1.11258543831093 | 0.00935157469816006 | 0.0137523157325883 | 123 |
| INTERFERON_ALPHA_RESPONSE | 158 | 1.18824427654253 | 0.0184387753694073 | 0.0263411076705819 | 49 |
| APICAL_SURFACE | 142 | 1.19709034950041 | 0.0197614340616903 | 0.0274464361967921 | 31 |
| NOTCH_SIGNALING | 99 | 1.23636420287506 | 0.0222803736236655 | 0.0301086130049534 | 20 |
| ALLOGRAFT_REJECTION | 359 | 1.09140387222255 | 0.0517507656437572 | 0.0680931126891543 | 101 |
| KRAS_SIGNALING_DN | 455 | 1.07232300390841 | 0.0706418972348002 | 0.0905665349164105 | 117 |
| CHOLESTEROL_HOMEOSTASIS | 155 | 1.12700521636616 | 0.0753616524855787 | 0.0942020656069734 | 47 |
| WNT_BETA_CATENIN_SIGNALING | 140 | 1.1286634877976 | 0.0839096466534721 | 0.102328837382283 | 30 |
| FATTY_ACID_METABOLISM | 279 | 1.08475876599981 | 0.0920066254830114 | 0.109531697003585 | 91 |
| ANGIOGENESIS | 104 | 1.12332164952526 | 0.128663088630856 | 0.149608242594018 | 18 |
| HEDGEHOG_SIGNALING | 145 | 0.969022723736511 | 0.659136328916236 | 0.749018555586632 | 30 |
| SPERMATOGENESIS | 258 | 0.90177989643402 | 0.957419784368706 | 1 | 79 |
| UV_RESPONSE_DN | 554 | 0.924360737401476 | 0.972326175648468 | 1 | 95 |
| BILE_ACID_METABOLISM | 188 | 0.853630810616856 | 0.988205830454095 | 1 | 61 |
| KRAS_SIGNALING_UP | 508 | 0.870884862971025 | 0.999415085430068 | 1 | 119 |
| PANCREAS_BETA_CELLS | 81 | 0.685957113608895 | 0.999876069754364 | 1 | 26 |
| EPITHELIAL_MESENCHYMAL_TRANSITION | 442 | 0.697646781527174 | 1 | 1 | 112 |
Note: This table shows the pathway enrichment annotation results for peaks from H3K4me1-H3K4me1 in the C1 sample.
pathway_annotation.pdf (when plot = TRUE)-log10(padj) for visual
stabilitylog2(1 + fold_enrichment)
library(multiEpiCore)
# ===== Test Data =======
pk_dirs <- list.dirs("peak", full.names = TRUE, recursive = FALSE)
for (pk_dir in pk_dirs) {
sample <- basename(pk_dir)
sample_out <- file.path(pk_dir, "pathway_annotation")
peak_paths <- list.files(path = pk_dir, pattern = "\\.narrowPeak$", recursive = FALSE, full.names = TRUE)
peak_pathway_annotation(
peak_path = peak_paths,
out_dir = sample_out
)
}
The peak_genomic_distribution() function performs
genomic feature annotation for peak BED files and quantifies their
distribution across regulatory element categories. Each peak set is
converted into genomic ranges and annotated against selected reference
resources such as gene features, cCRE elements, ChromHMM chromatin
states, and repeat elements, followed by optional visualization of
distribution patterns.
| Parameter | Type | Default | Description | Example |
|---|---|---|---|---|
bed_path |
character | — |
Path(s) to input BED files, or a directory containing BED files.
Each BED file is treated as one target, and target names are extracted
from filenames using pattern.
|
c("CTCF_peaks.narrowPeak", "H3K27ac_peaks.narrowPeak") |
out_dir |
character | "./" |
Output directory for annotation results. Results for each annotation type and target are saved here. | "./distribution_results" |
pattern |
character | "_peaks\\.narrowPeak$" |
Regular expression used to remove suffix text from each BED filename when constructing target names. This must match the naming convention of your peak files; otherwise, target names may be incorrect or non-unique. | pattern = "\\.narrowPeak$" |
distributions |
character vector | c("genic", "ccre") |
Annotation types to compute.
Supported options are "genic", "ccre",
"chromhmm", and "repeat".
At least one annotation type must be specified.
|
c("genic", "ccre", "repeat") |
ref_genome |
character | "hg38" |
Reference genome version used for annotation.
Supported options are "hg38" and "mm10".
|
"mm10" |
ref_source |
character | "knownGene" |
Gene annotation source for genic-based annotation.
Supported options are "knownGene" (UCSC)
and "GENCODE".
|
"GENCODE" |
mode |
character | "nearest" |
Method used to assign peaks to genomic features.
"nearest" assigns each peak to the closest feature,
while "weighted" assigns annotations proportionally
based on overlap length.
|
"weighted" |
plot |
logical | TRUE |
Whether to generate visualization plots summarizing genomic distribution across annotation categories. | FALSE |
"genic": Gene structural features - Promoter, 5’ UTR,
Exon, Intron, 3’ UTR"ccre": Candidate cis-Regulatory Elements - dELS, pELS,
PLS, CA-H3K4me3, CA-CTCF, CA-TF, TF, CA"chromhmm": Chromatin states - Acet, EnhWk, EnhA,
PromF, TSS, TxWk, TxEx, Tx, OpenC, TxEnh, ReprPCopenC, BivProm, ZNF,
ReprPC, HET, GapArtf, Quies"repeat": Repetitive elements - SINE, LINE, LTR,
Retroposon, RC, DNA, Satellite, Simple_repeat, Low_complexity, rRNA,
tRNA, snRNA, scRNA, srpRNA, RNA, Unknown"nearest": Each genomic region is assigned to its
closest feature (by distance to feature center)"weighted": Each region is proportionally assigned to
overlapping features based on overlap lengthnearest mode is faster and simpler;
weighted mode provides more accurate representation for
regions spanning multiple featuresThe function generates the following output files in the specified
out_dir: 1. Composition tables -
{genic/ccre/chromhmm/repeat}_distribution.tsv - .tsv tables
summarizing the percentage composition of various annotations - Row:
cluster lables - Col: annotations states - Values represent the
proportion (%) of genomic regions assigned to each state within a
cluster
| Promoter | 5’ UTR | Exon | Intron | 3’ UTR | other | |
|---|---|---|---|---|---|---|
| H3K27me3-H3K4me1 | 13.55 | 1.39 | 7.16 | 46.07 | 2.37 | 29.46 |
| H3K27me3-H3K4me3 | 25.88 | 3.98 | 9.22 | 38.76 | 2.14 | 20.02 |
| H3K27me3-H3K9me2 | 8.96 | 0.96 | 5.57 | 49.88 | 2.05 | 32.58 |
| H3K27me3-H3K9me3 | 10.61 | 1.04 | 6.17 | 47.40 | 2.06 | 32.72 |
| …… | ||||||
| H3K9me3-H3K9me3 | 6.79 | 0.70 | 4.30 | 45.85 | 1.41 | 40.95 |
| dELS | pELS | PLS | CA-H3K4me3 | |
|---|---|---|---|---|
| H3K27me3-H3K4me1 | 50.47 | 13.57 | 8.13 | 2.40 |
| H3K27me3-H3K4me3 | 35.62 | 21.39 | 21.62 | 2.01 |
| H3K27me3-H3K9me2 | 39.89 | 9.36 | 4.06 | 2.66 |
| H3K27me3-H3K9me3 | 40.86 | 9.89 | 5.74 | 3.17 |
| …… | ||||
| H3K9me3-H3K9me3 | 26.82 | 5.78 | 3.51 | 4.54 |
| CA-CTCF | CA-TF | TF | CA | other | |
|---|---|---|---|---|---|
| H3K27me3-H3K4me1 | 2.43 | 0.86 | 2.18 | 4.41 | 15.55 |
| H3K27me3-H3K4me3 | 1.50 | 0.47 | 1.62 | 2.99 | 12.79 |
| H3K27me3-H3K9me2 | 2.79 | 0.78 | 2.58 | 6.06 | 31.82 |
| H3K27me3-H3K9me3 | 2.82 | 0.82 | 2.64 | 5.86 | 28.21 |
| …… | |||||
| H3K9me3-H3K9me3 | 2.98 | 0.77 | 3.02 | 6.78 | 45.79 |
Note: These tables show the genomic distribution results for peaks from H3K4me1-H3K4me1 in the C1 sample.
{genic/ccre/chromhmm/repeat}_distribution.pdf


Note: These figurs show the genomic distribution results for peaks from H3K4me1-H3K4me1 in the C1 sample.
library(multiEpiCore)
# Test Data
pk_dirs <- list.dirs("peak", full.names = TRUE, recursive = FALSE)
for (pk_dir in pk_dirs) {
sample <- basename(pk_dir)
sample_out <- file.path(pk_dir, "genomic_distribution")
peak_paths <- list.files(path = pk_dir, pattern = "\\.narrowPeak$", recursive = FALSE, full.names = TRUE)
peak_genomic_distribution(
peak_path = peak_paths,
out_dir = sample_out
)
}
Unlike other steps in this package where each function operates across all CRF pairs simultaneously, the differential analysis here is structured per CRF pair across samples. This reflects the biological reality that each CRF pair targets a distinct genomic locus set — peak regions from different CRF pairs are not directly comparable and should never be pooled into a shared count matrix.
peak_differential_analysis() is the recommended entry
point for this step. It wraps build_peak_set() and
differential_regions_single_peak() into a single call: for
each CRF pair found consistently across all sample directories, it first
constructs a condition-aware consensus peak set, then counts fragment
overlap per sample and runs the voom-limma differential analysis.
| Parameter | Type | Default | Description | Example |
|---|---|---|---|---|
peak_dirs |
character vector | — | One directory per sample containing narrowPeak files. Must be in the same order as conditions. Each directory is scanned for files matching peak_pattern; the pair name is extracted by stripping the pattern from the filename. |
peak_dirs = c("./peak/C1", "./peak/C2", "./peak/T1", "./peak/T2") |
bam_dirs |
character vector | — | One directory per sample containing BAM files, in the same order as peak_dirs and conditions. Pair names are extracted by stripping bam_pattern from the filename; only pairs present across all BAM and peak directories are processed. |
bam_dirs = c("./bam/C1", "./bam/C2", "./bam/T1", "./bam/T2") |
conditions |
character vector | — | Condition label for each sample, in the same order as peak_dirs and bam_dirs. Each condition must have at least 2 replicates. |
conditions = c("C", "C", "T", "T") |
sample_names |
character vector or NULL | NULL |
Optional display names for each sample, in the same order as conditions. If NULL, names are auto-generated as {condition}{replicate_index}. |
sample_names = c("C1", "C2", "T1", "T2") |
ref_genome |
character | "hg38" |
Reference genome used to obtain chromosome sizes for boundary checking during tiling. Must be one of "hg38" or "mm10". Ignored when window_size is NULL. |
ref_genome = "mm10" |
out_dir |
character | "./" |
Root output directory. Consensus BED files are written to {out_dir}/peak_sets/ and differential results to {out_dir}/differential/. |
out_dir = "./results" |
bam_pattern |
character | "\\.bam$" |
Regular expression used to remove suffix text from each BAM filename when constructing target names.
This must match the naming convention of your BAM files and produce the same target names as peak_pattern;
otherwise, pair matching across peak and BAM directories will fail.
|
bam_pattern = "\\.bam$" |
peak_pattern |
character | "_peaks\\.narrowPeak$" |
Regex pattern used to scan peak_dirs and to strip the suffix when extracting pair names from filenames. |
peak_pattern = "_peaks\\.narrowPeak$" |
window_size |
integer or NULL | NULL |
Passed to build_peak_set(). If NULL, master regions are used directly. If a positive integer, regions are tiled into fixed-size windows. |
window_size = 500L |
min_support |
integer | 2 |
Passed to both build_peak_set() and differential_regions_single_peak(). Minimum number of samples within at least one condition group required to retain a region at the peak-building stage, and to pass the non-zero filter at the testing stage. |
min_support = 2 |
lfc_threshold |
numeric | 0.5 |
Minimum absolute log2 fold-change for significance. | lfc_threshold = 0.5 |
fdr_threshold |
numeric | 0.05 |
Adjusted p-value (BH) threshold for significance. | fdr_threshold = 0.05 |
{out_dir}/
├── peak_sets/
│ ├── {pair}.bed # consensus peak set per CRF pair
│ └── ...
└── differential/
├── {test}_vs_{ref}_{pair}_all.tsv
├── {test}_vs_{ref}_{pair}_sig.tsv
├── {test}_vs_{ref}_summary.tsv
├── {test}_vs_{ref}_summary.pdf
└── ...
peak_differential_analysis(
peak_dirs = c("./peak/C1", "./peak/C2", "./peak/T1", "./peak/T2"),
bam_dirs = c("./bam/C1", "./bam/C2", "./bam/T1", "./bam/T2"),
conditions = c("C", "C", "T", "T"),
sample_names = c("C1", "C2", "T1", "T2"),
out_dir = "./peak",
min_support = 2,
window_size = NULL,
lfc_threshold = 0.5,
fdr_threshold = 0.05
)
Given a set of narrowPeak files from the same target(CRF
pair) across multiple samples, build_peak_set()
function constructs a unified peak set that represents regions
consistently enriched across the cohort.
What this function does:
First, all per-sample peak intervals are pooled together. The genome
is then atomized into the smallest non-overlapping sub-intervals defined
by all peak boundaries, and each sub-interval is scored by how many
distinct samples contain it. Sub-intervals supported by fewer than
min_support samples are discarded as sample-specific noise;
the remaining sub-intervals are merged into contiguous consensus blocks
wherever they are adjacent.
window_size is NULL, these master
regions are exported directly as the final peak set (each region retains
its natural variable width).window_size is specified, each master region is
divided into fixed-size windows tiled outward from the region’s center.
The number of windows is ceil(region_width / window_size). For an odd
number of windows, the center window is anchored at the region midpoint;
for an even number, the midpoint falls at the boundary between the two
central windows. Windows that extend beyond chromosome boundaries are
discarded.| Parameter | Type | Default | Description | Example |
|---|---|---|---|---|
peak_path
|
character vector | — |
Vector of narrowPeak or BED file paths, one per sample. Each file must
contain at least three columns: chrom,
chromStart, chromEnd.
|
peak_path = c(“S1_peaks.narrowPeak”, “S2_peaks.narrowPeak”)
|
pair
|
character | — |
Output file basename (without extension). The final BED file is written
as <pair>.bed inside out_dir.
|
pair = “H3K27ac_consensus”
|
conditions
|
character vector | — |
Condition label for each sample, in the same order as
peak_path. Used together with min_support to
apply a per-group reproducibility filter.
|
conditions = c(“ctrl”, “ctrl”, “treat”, “treat”)
|
ref_genome
|
character |
“hg38”
|
Reference genome used to obtain chromosome sizes for boundary checking
during tiling. Must be one of “hg38” or
“mm10”. Ignored when window_size is
NULL.
|
ref_genome = “mm10”
|
out_dir
|
character |
“./”
|
Output directory. Created recursively if it does not exist. |
out_dir = “./peak_sets”
|
window_size
|
integer or NULL |
NULL
|
If NULL, master regions are exported directly. If a
positive integer, each master region is tiled into fixed-size windows of
this width. Regions narrower than window_size are skipped.
|
window_size = 500L
|
min_support
|
integer |
2
|
Minimum number of distinct samples within a single condition group required to retain an atomic sub-interval. A sub-interval is kept if at least one condition group meets this threshold, so condition-specific but reproducible peaks are not discarded. |
min_support = 3
|
{pair}.bedchrom, chromStart,
chromEndlibrary(multiEpiCore)
library(stringr)
# =============== Test data ===============
peak_dirs <- c("./peak/C1", "./peak/C2", "./peak/T1", "./peak/T2")
conditions <- c("C", "C", "T", "T")
pattern <- "_peaks\\.narrowPeak$"
out_dir <- "./peak/peak_sets"
# collect peak files and pair names per directory
dir_peaks <- lapply(peak_dirs, function(d) {
paths <- list.files(d, pattern = pattern, full.names = TRUE)
pairs <- basename(paths) |> str_replace(pattern, "")
setNames(paths, pairs)
})
# retain only pair names present in ALL directories
common_pairs <- Reduce(intersect, lapply(dir_peaks, names))
message(length(common_pairs), " pairs found across all samples.")
# build peak set for each common pair
peak_set_paths <- lapply(common_pairs, function(pair) {
peak_path <- sapply(dir_peaks, function(x) x[[pair]])
build_peak_set(
peak_path = peak_path,
conditions = conditions,
pair = pair,
out_dir = out_dir,
window_size = NULL,
min_support = 2
)
})
names(peak_set_paths) <- common_pairs
To restrict the peak set to specific genomic contexts, the output BED
file can be further filtered or trimmed using
bedtools intersect:
# Filter peak_set to regions overlapping a set of interest
# retaining the original peak_set intervals
bedtools intersect \
-a peak_set.bed \
-b regions_of_interest.bed \
-u \
> peak_set.overlap_only.bed
# Restrict peak_set to only the bases shared with regions_of_interest,
# collapsing all overlapping region entries into a single merged interval per peak (-u).
bedtools intersect \
-a peak_set.bed \
-b regions_of_interest.bed \
-u \
| bedtools merge \
> peak_set.overlap_trimmed.bed
# Trim each peak_set interval to the exact overlapping segments with regions_of_interest
# splitting a single peak into multiple shorter intervals if it overlaps multiple regions.
bedtools intersect \
-a peak_set.bed \
-b regions_of_interest.bed \
> peak_set.overlap_segments.bed
Given a consensus peak set (e.g. produced by
build_peak_set()) and a set of BAM files from the
same target (CRF pair) across multiple samples,
differential_regions_single_peak() counts fragment overlap
per region per sample, then identifies differentially accessible regions
(DARs) between conditions using a two-round voom-limma framework.
What this function does:
For each BAM file, paired-end fragments are reconstructed from read
pairs and their overlap with each consensus region is quantified as a
proportional count. The resulting region × sample count matrix is passed
to a two-round voom + limma pipeline: a first voom pass is used to
filter low-expression regions by mean expression quantile, and a second
voom pass fits the final linear model. Empirical Bayes shrinkage
(eBayes) is applied and all pairwise condition comparisons
are tested. Results are written to disk as full result tables,
significant DAR tables, and per-comparison summary files.
| Parameter | Type | Default | Description | Example |
|---|---|---|---|---|
bam_path
|
character vector | — |
Vector of BAM file paths, one per sample, all from the same CRF
pair. Must be in the same order as conditions and
sample_names. Each BAM must be coordinate-sorted and
indexed.
|
bam_path = c(“C1/CRF1.bam”, “C2/CRF1.bam”, “T1/CRF1.bam”,
“T2/CRF1.bam”)
|
conditions
|
character vector | — |
Condition label for each sample, in the same order as
bam_path. Each condition must have at least 2 replicates.
|
conditions = c(“C”, “C”, “T”, “T”)
|
regions
|
character | — |
Path to a BED file defining the consensus regions to test. Typically the
output of build_peak_set(). Must be a headerless,
tab-separated file with columns chrom,
chromStart (0-based), chromEnd.
|
regions = “./peak_sets/CRF1.bed”
|
pair
|
character | — | Name of the CRF pair being analyzed. Used as a prefix in all output filenames. |
pair = “CRF1”
|
sample_names
|
character vector or NULL |
NULL
|
Optional display names for each sample, in the same order as
bam_path. If NULL, names are auto-generated as
{condition}{replicate_index}.
|
sample_names = c(“C1”, “C2”, “T1”, “T2”)
|
out_dir
|
character |
“./”
|
Output directory. Created recursively if it does not exist. |
out_dir = “./differential/CRF1”
|
min_support
|
integer |
2
|
Minimum number of samples with non-zero counts within at least one condition group for a region to be tested. |
min_support = 2
|
mean_quantile
|
numeric |
0
|
Regions with mean log2 expression below this quantile (from the first voom pass) are removed before the second voom fit. |
mean_quantile = 0.1
|
lfc_threshold
|
numeric |
0.5
|
Minimum absolute log2 fold-change for a region to be called significant. |
lfc_threshold = 0.5
|
fdr_threshold
|
numeric |
0.05
|
Adjusted p-value (BH) threshold for significance. |
fdr_threshold = 0.05
|
All files are written to out_dir, organized by
comparison tag (in the form {test}_vs_{ref}) read from the
conditions, with pair or cluster name as a subdirectory prefix.
<comparison>/<pair>_all.tsv
pos,
logFC, AveExpr, t,
P.Value, adj.P.Val, B| pos | logFC | AveExpr | t | P.Value | adj.P.Val | B |
|---|---|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| chr1_161083373_161084202 | 3.9289 | 6.4669 | 10.5248 | 9.49e-12 | 1.41e-07 | 16.7132 |
| chr1_205229920_205230422 | -4.0230 | 5.6295 | -9.2274 | 2.14e-10 | 1.59e-06 | 13.6221 |
| chr19_51417328_51417981 | 3.3961 | 5.8269 | 8.7129 | 7.83e-10 | 3.87e-06 | 12.4792 |
| chr22_19447971_19448534 | -3.8683 | 5.5171 | -8.4705 | 1.46e-09 | 5.41e-06 | 11.7937 |
| chr20_51360094_51360812 | 3.2724 | 5.7498 | 8.0526 | 4.35e-09 | 1.11e-05 | 10.8309 |
| … | ||||||
Note: This table shows differential analysis results for H3K4me1-H3K4me1 in the test dataset.
<comparison>/<pair>_sig.tsv
fdr_threshold and
lfc_threshold<comparison>/summary.tsv
| group | n_rows_before | n_rows_after_nonzero | n_rows_after_mean | n_sig | n_up | n_down |
|---|---|---|---|---|---|---|
| <chr> | <int> | <int> | <int> | <int> | <int> | <int> |
| H3K9me2-H3K9me3 | 624 | 613 | 613 | 89 | 39 | 50 |
| H3K4me1-H3K9me3 | 7406 | 7368 | 7368 | 0 | 0 | 0 |
| H3K9me3-H3K9me3 | 17495 | 17417 | 17417 | 1408 | 614 | 794 |
| H3K27me3-H3K9me3 | 18828 | 18751 | 18751 | 1358 | 653 | 705 |
| H3K4me3-H3K4me3 | 15631 | 15630 | 15630 | 668 | 443 | 225 |
| H3K4me1-H3K4me1 | 14844 | 14836 | 14836 | 2235 | 1166 | 1069 |
| H3K27me3-H3K4me1 | 39047 | 39046 | 39046 | 2179 | 1058 | 1121 |
| H3K4me3-H3K9me3 | 12988 | 12982 | 12982 | 31 | 17 | 14 |
| H3K4me1-H3K4me3 | 15595 | 15594 | 15594 | 250 | 165 | 85 |
| H3K27me3-H3K4me3 | 17133 | 17124 | 17124 | 59 | 46 | 13 |
| H3K27me3-H3K9me2 | 728 | 710 | 710 | 140 | 32 | 108 |
| … | ||||||
Note: Rows are appended across calls sharing the same
out_dir, so results from multiple pairs within the same
comparison are accumulated into a single file.
library(multiEpiCore)
library(stringr)
# =============== Test data ===============
bam_dirs <- c("./bam/C1", "./bam/C2", "./bam/T1", "./bam/T2")
conditions <- c("C", "C", "T", "T")
pattern <- "\\.bam$"
peak_set_dir <- "./peak/peak_sets" # output of build_peak_set()
out_dir <- "./peak/differential"
# collect BAM files and pair names per directory
dir_bams <- lapply(bam_dirs, function(d) {
paths <- list.files(d, pattern = pattern, full.names = TRUE)
setNames(paths, basename(paths) |> str_replace(pattern, ""))
})
# retain only pairs present in ALL directories
common_pairs <- Reduce(intersect, lapply(dir_bams, names))
message(length(common_pairs), " pairs found across all samples.")
if (length(common_pairs) == 0)
stop("No common pairs found. Check BAM filenames and directories.")
# run differential analysis for each common pair
results <- lapply(common_pairs, function(pair) {
bam_path <- sapply(dir_bams, function(x) x[[pair]])
differential_regions_single_peak(
bam_path = bam_path,
conditions = conditions,
regions = file.path(peak_set_dir, paste0(pair, ".bed")),
pair = pair,
out_dir = out_dir,
min_support = 2,
mean_quantile = 0,
lfc_threshold = 0.5,
fdr_threshold = 0.05
)
})
names(results) <- common_pairs