1. cCRE & ChromHMM Annotation

2. TFBS Annotation

2A. Generate matched control regions

The get_matched_control() function generates background control regions by randomly selecting genes with similar length to the nearest gene of each target peak, then placing control regions at the same TSS-relative distance on the selected genes. This approach preserves the relationship between peaks and gene structure while avoiding biases from using the same genomic neighborhoods.

What this function does:

  • Identifies the nearest gene for each target genomic region

  • Finds candidate genes with similar length (within tolerance threshold)

  • Randomly selects control genes from candidates, excluding the original gene

  • Places control regions at identical TSS-relative distances on control genes

  • Validates control regions are within chromosome boundaries

  • Generates multiple control regions per target (configurable replicates)

  • Automatically relaxes length tolerance if insufficient candidates found

  • Returns GRanges object with all generated control regions

Parameters

Parameter Type Required Description Example
target_gr GRanges Yes GRanges object containing target peak regions for which controls will be generated target_gr = my_peaks
ref_genome character No (default: “hg38”) Reference genome version: “hg38” (Human GRCh38) or “mm10” (Mouse GRCm38) ref_genome = "hg38"
style character No (default: “UCSC”) Chromosome naming style (e.g., “UCSC” uses “chr1”, “Ensembl” uses “1”) style = "UCSC"
n_rep integer No (default: 1) Number of control regions to generate per target peak n_rep = 3
regions integer No (default: 800) Width of control regions in base pairs (centered on calculated position) regions = 500
seed integer No (default: 42) Random seed for reproducible control region selection seed = 12345
length_tolerance numeric No (default: 0.2) Initial tolerance for gene length matching (0.2 = ±20%). Automatically relaxed up to ±100% if insufficient candidates length_tolerance = 0.3
  • Chromosome handling:
    • Only uses standard chromosomes (excludes mitochondrial DNA)
    • For hg38: chr1-chr22, chrX, chrY
    • For mm10: chr1-chr19, chrX, chrY
    • Control regions must fit within chromosome boundaries
    • Invalid positions (out of bounds) are skipped
  • TSS-relative positioning:
    • Preserves signed distance from target peak to nearest gene TSS
    • Negative distance: peak is upstream of TSS
    • Positive distance: peak is downstream of TSS
    • Accounts for gene strand orientation (+ or -)
    • Control center = Control gene TSS + original TSS distance
  • Gene length matching:
    • Initial tolerance: ±20% of original gene length (default)
    • If insufficient candidates (< n_rep * 2), tolerance increases by 10% increments
    • Maximum automatic tolerance: ±100% (doubled or halved gene length)
    • If still insufficient, uses all genes except the original
    • Ensures control genes have comparable structural features

Output

The function returns a GRanges object containing all successfully generated control regions.

  • Object type: GRanges (GenomicRanges package)

  • Number of regions: Up to n_rep × length(target_gr) regions

    • Actual count may be lower if some controls fail boundary validation
    • Each target peak ideally generates n_rep control regions
  • Region characteristics:

    • Width: All regions have uniform width specified by regions parameter
    • Positioning: Centered on calculated TSS-relative positions
    • Chromosome style: Follows the specified style parameter (e.g., “chr1” for UCSC)
    • Genome info: Includes seqlengths from reference genome

Usage Examples

library(epigenomeR)
library(GenomicRanges)

# Load target peaks
target_peaks <- rtracklayer::import("CTCF_peaks.bed")

# Basic usage - 1 control per target
control_regions <- get_matched_control(
  target_gr = target_peaks,
  ref_genome = "hg38"
)

# Generate multiple controls per target
control_regions <- get_matched_control(
  target_gr = target_peaks,
  ref_genome = "hg38",
  n_rep = 5,
  regions = 500
)

# Mouse genome with relaxed length tolerance
control_regions <- get_matched_control(
  target_gr = target_peaks,
  ref_genome = "mm10",
  n_rep = 3,
  length_tolerance = 0.3,
  seed = 123
)

2B. TFBS Enrichment Analysis

The TFBS_enrichment() function performs motif enrichment analysis comparing a set of target genomic regions against control regions using Fisher’s exact test to identify significantly enriched transcription factor binding motifs.

What this function does:

  • Compares target genomic regions (GRanges) against background/control regions (GRanges)

  • Works directly with GRanges objects without file I/O for region inputs

  • Overlaps regions with a comprehensive motif library (stored as RDS files)

  • Optionally filters motif sites using functional region annotations (GRanges)

  • Counts motif occurrences in target vs control regions using countOverlaps()

  • Performs Fisher’s exact test for each motif with pseudocount correction

  • Calculates odds ratios with standard errors

  • Adjusts p-values for multiple testing using Benjamini-Hochberg FDR

  • Identifies transcription factors with significantly enriched binding sites

  • Provides quantitative enrichment statistics for downstream interpretation

Parameters

Parameter Type Required Description Example
target_region GRanges Yes GRanges object containing genomic regions of interest target_region = my_peaks
control_region GRanges Yes GRanges object containing background/control regions for comparison control_region = background_peaks
functional_region GRanges or NULL No (default: NULL) Optional GRanges object with functional regions for filtering motif sites. Only motifs overlapping these regions are considered functional_region = open_chromatin
out_path character No (default: “./TFBS_enrichment.tsv”) File path to save enrichment results table (TSV format) out_path = "./results/motif_enrichment.tsv"
regions integer or NULL No (default: NULL) Size in base pairs to resize all regions around their center point. If NULL, regions are used as-is regions = 500
ref_genome character No (default: “hg38”) Reference genome version: “hg38” or “mm10” ref_genome = "hg38"
style character or NULL No (default: NULL) Chromosome naming style (e.g., “UCSC”, “Ensembl”). If NULL, auto-detects from target_region style = "UCSC"
  • Input format:
    • Function expects GRanges objects directly (not file paths)
    • All chromosome naming styles are harmonized automatically
    • If style is NULL, uses the style from target_region
  • Region resizing:
    • Optional: Controlled by the regions parameter
    • If regions = NULL (default): Regions are used with their original widths
    • If regions is specified (e.g., 500): Both target and control regions are resized to the specified width, centered on each region’s midpoint.
  • Statistical method:
    • Adds pseudocount (+1) to all counts to avoid division by zero
    • Uses one-sided Fisher’s exact test (alternative = “greater”)
    • Tests enrichment hypothesis (motif more frequent in target)

Output Files

The function generates the following output file at the specified output_path:

  1. Motif enrichment table - <output_path> (TSV format)
    • Tab-separated table with motif enrichment statistics
    • Rows: Individual transcription factor motifs from JASPAR database
    • Columns:
      • Row names: Motif identifiers (e.g., “MA0139.1” for CTCF)
      • odds_ratio: Odds ratio from Fisher’s exact test (enrichment magnitude)
        • Value > 1: Motif enriched in target regions
        • Value < 1: Motif depleted in target regions
        • Higher values indicate stronger enrichment
      • pvalue: Raw p-value from Fisher’s exact test (one-sided, greater)
        • Tests if motif is significantly more frequent in target vs control
        • Lower values indicate stronger statistical significance
      • odds_ratio_se: Standard error of log odds ratio
        • Measure of uncertainty in odds ratio estimate
        • Useful for calculating confidence intervals
      • FDR: False Discovery Rate (Benjamini-Hochberg adjusted p-value)
        • Corrects for multiple testing across all motifs
        • Typical threshold: FDR < 0.05 or 0.01
      • target_hit: Number of target regions overlapping motif sites
        • Count of target regions containing the motif
      • control_hit: Number of control regions overlapping motif sites
        • Count of control regions containing the motif
    • Sorted by row names (motif IDs)
    • Can be sorted by FDR or odds_ratio for prioritization
    feature odds_ratio pvalue odds_ratio_se FDR target_hit control_hit
    (CRISPR targeting AFF4)AFF4 16.14696 2.348799e-123 0.1066921 1.327293e-122 194 768
    (CRISPR targeting ARID3B)ARID3B 35.14312 1.320588e-76 0.1620585 3.858694e-76 78 95
    (CRISPR targeting ARID4B)ARID4B 34.39295 7.107417e-223 0.1229221 4.257343e-221 308 1241
    (CRISPR targeting ATF1)ATF1 18.37715 3.968721e-74 0.1327106 1.132030e-73 98 242
    (CRISPR targeting ATF2)ATF2 18.67509 6.632538e-33 0.1978719 1.119124e-32 40 83
    (CRISPR targeting ATF3)ATF3 23.37675 9.848921e-136 0.1117167 7.023218e-135 174 444

Example Usage

library(epigenomeR)
target_region <- rtracklayer::import("target_peaks.bed")
control_region <- rtracklayer::import("control_peaks.bed")

# Basic enrichment analysis (using original region widths)
TFBS_enrichment(
  target_region = target_region,
  control_region = control_region,
  out_path = "./TFBS_enrichment.tsv",
  ref_genome = "hg38"
)

# With uniform region resizing (200bp centered windows)
TFBS_enrichment(
  target_region = target_region,
  control_region = control_region,
  regions = 200,
  out_path = "./TFBS_enrichment_200bp.tsv",
  ref_genome = "hg38"
)

# With functional region filtering (e.g., open chromatin)
functional_region <- rtracklayer::import("ATAC_peaks.bed")
TFBS_enrichment(
  target_region = target_region,
  control_region = control_region,
  regions = 500,
  functional_region = functional_region,
  out_path = "./TFBS_enrichment_filtered.tsv",
  ref_genome = "hg38"
)

2C. TFBS Enrichment Heatmap Visualization

The TFBS_enrichment_heatmap() function generates heatmaps to visualize transcription factor binding site (TFBS) enrichment patterns across multiple genomic clusters or experimental conditions, using results from TFBS_enrichment() analysis.

What this function does:

  • Reads multiple TFBS enrichment result files from TFBS_enrichment() output

  • Combines odds ratios, FDR values, and target hits across all samples/clusters

  • Filters TFBS based on significance (FDR ≤ 0.05), effect size (odds ratio ≥ 2), and coverage criteria

  • Allows filtering for specific transcription factors of interest via pattern matching

  • Applies log2 transformation to odds ratios for symmetric visualization

  • Generates a comprehensive heatmap showing all TFBS passing filtering criteria

  • (Optional) Creates a focused heatmap displaying the union of top N most enriched TFBS per sample

  • (Optional) Supports optional hierarchical clustering of samples for pattern discovery

  • Produces publication-ready visualizations with consistent formatting and scalable dimensions

  • Exports log2-transformed odds ratio and FDR matrices for downstream analysis

Parameters

Parameter Type Required Description Example
tsv_path character vector Yes Vector of file paths to TFBS enrichment result TSV files from TFBS_enrichment() analysis tsv_path = c("cluster1.tsv", "cluster2.tsv")
label character vector Yes Vector of sample/cluster labels corresponding to tsv_path. Must have same length as tsv_path label = c("Cluster1", "Cluster2", "Cluster3")
out_dir character Yes Output directory to save heatmaps and data matrices out_dir = "./heatmaps"
top_n integer No (default: NULL) Number of top enriched TFBS per sample to display in second heatmap. If NULL, only the full heatmap is generated top_n = 10
selected_tfs character vector No (default: NULL) Vector of transcription factor names or substrings to filter rows (TFBS). Matching is case-insensitive. If NULL, all TFBS retained selected_tfs = c("CTCF", "AP1", "STAT")
apply_cluster logical No (default: FALSE) Whether to apply hierarchical clustering to columns and reorder them. If TRUE, columns are clustered and reversed for visualization apply_cluster = TRUE
  • Input file requirements:
    • Files must be TSV format with tab-separated values
    • Must be output from TFBS_enrichment() function
    • Required columns: odds_ratio, FDR, target_hit
    • Row names: TFBS/motif IDs (e.g., JASPAR identifiers)
    • All files should have consistent TFBS IDs (mismatches will trigger warnings)
  • Filtering logic:
    • Stringent default criteria (OR ≥ 2, FDR ≤ 0.05)
    • Requires meeting ALL criteria in at least ONE sample
    • Target hit threshold: 10th percentile prevents low-coverage artifacts
    • Only non-NA rows/columns retained
    • Filters applied sequentially: NA removal → biological criteria → selected_tfs
  • Column clustering:
    • Without clustering (apply_cluster = FALSE): Columns ordered by sorted labels
    • With clustering (apply_cluster = TRUE):
      • Hierarchical clustering applied (Euclidean distance, complete linkage)
      • Column order extracted and reversed for better visualization
      • Final heatmap drawn with fixed order (no dendrogram shown)
    • Clustering applied identically to both full and top N heatmaps if both generated

Output Files

The function generates the following output files in the specified out_dir:

  1. Full matrix heatmap - TFBS_heatmap_all.pdf
    • Heatmap showing all motifs passing filtering criteria
    • Rows: Transcription factor motifs (TF IDs)
    • Columns: Clusters or experimental conditions
    • Color scale: Blue (low enrichment) → White (neutral) → Red (high enrichment)
    • Symmetric color scale centered at zero
    • Log2-transformed odds ratios for visualization
    • No row/column clustering by default unless “cluster” transformation specified
  2. Top N motifs heatmap - TFBS_heatmap_top<n>.pdf (if top_n provided)
    • Heatmap showing union of top N most enriched TFBS per sample
    • Selects top N TFBS from each sample independently, then takes union
    • Optional column clustering if apply_cluster = TRUE
  3. Log2 odds ratio matrix - peak_no_perturb.csv
    • CSV file containing log2-transformed odds ratios
    • Rows: TFBS IDs
    • Columns: Cluster/condition names
    • Values: log2(odds_ratio) from enrichment analysis
    • Filtered to significant TFBS only (odds_ratio ≥ 2, FDR ≤ 0.05, target_hit ≥ 10th percentile)
    • Reflects the exact data shown in heatmaps
  4. FDR matrix - peak_no_perturb_fdr.csv
    • CSV file containing FDR-adjusted p-values
    • Rows: TFBS IDs matching odds_ratio_log2.csv
    • Columns: Sample/cluster names (same order as odds_ratio_log2.csv)
    • Values: FDR (False Discovery Rate) from enrichment analysis
    • Useful for identifying significant enrichments

Example Usage

library(epigenomeR)

# Example 1: Basic usage - generate heatmap for all filtered TFBS
TFBS_enrichment_heatmap(
  tsv_path = c(
    "output/TFBS_enrichment_Cluster1.tsv",
    "output/TFBS_enrichment_Cluster2.tsv",
    "output/TFBS_enrichment_Cluster3.tsv"
  ),
  label = c("Cluster1", "Cluster2", "Cluster3"),
  out_dir = "heatmaps/all_tfbs"
)

# Example 2: With top N selection - focus on most enriched TFBS
TFBS_enrichment_heatmap(
  tsv_path = c(
    "output/TFBS_enrichment_Cluster1.tsv",
    "output/TFBS_enrichment_Cluster2.tsv",
    "output/TFBS_enrichment_Cluster3.tsv"
  ),
  label = c("Cluster1", "Cluster2", "Cluster3"),
  out_dir = "heatmaps/top10",
  top_n = 10
)

# Example 3: Filter for specific transcription factors
TFBS_enrichment_heatmap(
  tsv_path = c(
    "output/TFBS_enrichment_TypeA.tsv",
    "output/TFBS_enrichment_TypeB.tsv",
    "output/TFBS_enrichment_TypeC.tsv"
  ),
  label = c("TypeA", "TypeB", "TypeC"),
  out_dir = "heatmaps/selected_tfs",
  top_n = 15,
  selected_tfs = c("CTCF", "AP1", "STAT", "NRF", "FOXO")
)