1. Highly Variable Regions

This workflow provides two complementary approaches for identifying highly variable genomic regions:

  • Integrated workflow (recommended): The detect_hvr() function combines mean-variance modeling and feature selection in a single step, internally calling mav_screen() and filter_hvr() in sequence.

    library(epigenomeR)
    path <- "./count_matrix/Count_Matrix_800_merged_transformed.feather"
    detect_hvr(transformed_cm_path = path, out_dir = "./count_matrix/")
  • Step-by-step workflow: For users who need fine-grained control or want to inspect intermediate results, the individual functions mav_screen() and filter_hvr() can be called separately.

Function Overview:

  • detect_hvr() - Complete pipeline for detecting highly variable regions (wrapper function)

  • mav_screen() - Step 1: Model mean-variance relationship and calculate hypervariance metrics

  • filter_hvr() - Step 2: Select top-ranked regions based on hypervariance values

1A. Model mean-variance relationship and calculate hypervariance metrics

The mav_screen() function models the mean-variance relationship in transformed count data and normalizes features by their expected variance. It computes hypervariance metrics that quantify each feature’s deviation from the expected mean-variance trend.

What this function does:

This function identifies overdispersed genomic features by modeling the mean-variance relationship using GAM or LOESS regression.

The function fits a regression model to characterize the relationship between feature mean and variance in log2 space across all genomic regions. From this fitted model, it calculates the expected variance for each feature based on its mean expression level. Each feature is then normalized by centering on its observed mean and scaling by the expected standard deviation derived from the model fit.

The function computes hypervariance metrics (sum of squared normalized values divided by degrees of freedom) to quantify the degree of overdispersion for each feature. Features with higher hypervariance values exhibit greater variability than expected from the mean-variance trend, indicating biological overdispersion. All statistics including observed mean/variance, expected variance, normalized statistics, and hypervariance metrics are saved to a feather file.

Parameters

Parameter Type Required Description Example
transformed_cm_path character Yes Path to transformed count matrix file (.feather format with “pos” column) transformed_cm_path = "data_transformed.feather"
out_dir character No (default: “./”) Output directory path for results out_dir = "./mav_screen_results"
fitting_model character No (default: “gam”) Model to fit mean-variance relationship: “gam” or “loess” fitting_model = "loess"
k integer No (default: 40) Number of basis functions for GAM spline fitting (only used when fitting_model = “gam”) k = 30
span numeric No (default: 0.5) Smoothing span for LOESS fitting (only used when fitting_model = “loess”) span = 0.3
seed integer No (default: 42) Random seed for reproducibility in sampling seed = 123
font_size numeric No (default: 10) Font size for plot labels and axes font_size = 12
nrow_sample_per numeric No (default: 0.2) Proportion of rows (0-1) to randomly sample for density plot nrow_sample_per = 0.1
plot logical No (default: FALSE) Whether to generate and save diagnostic plots (.png files) plot = TRUE
  • Input data requirements:

    The transformed_cm_path parameter refers directly to a pre-processed count matrix stored as a .feather file. This file must not contain raw counts. Instead, the matrix should already be:

    • Pre-normalized (e.g., quantile normalization, TPM, RPKM, or other appropriate normalization)
    • Log-transformed if necessary for your analysis
    • Batch-corrected if applicable
    • Quality-filtered to remove low-quality features

    Additionally, the .feather file must have the first column named “pos” containing feature identifiers (e.g., chr1_9601_10400).

  • Model selection:

    • Use GAM (default) for smooth, flexible fits with automatic complexity control
    • Use LOESS for local fitting, better for non-linear patterns
  • Memory considerations: For very large datasets, adjust nrow_sample_per to reduce memory usage for density plots.

Output Files

The function generates the following output files in the specified out_dir (default: “./”):

  1. Mean-Variance Statistics File - <input_name>_mav_screen.feather
    • Feather format dataframe with mean-variance statistics for each genomic region
    • Contains the following columns:
      • pos: Original region identifier
      • mean: Mean expression across all samples
      • var: Variance across all samples
      • var_expect: Expected variance from fitted mean-variance model
      • norm_mean: Mean of normalized values (should be ~0)
      • norm_var: Variance of normalized values (should be ~1)
      • hypervar: Hypervariance, sum of squared normalized values divided by (n-1)
      • log2_mean: Log2-transformed mean expression
      • log2_var: Log2-transformed variance
      • log2_var_expected: Log2-transformed expected variance from fitted model
    pos mean var var_expect norm_mean norm_var hypervar
    chr1_9601_10400 0.60482747 3.2718638368 2.999402e+00 2.999402e+00 -2.400220e-17 1.0908388
    chr1_10401_11200 0.06557750 0.1646858526 1.620011e-01 1.620011e-01 -6.149439e-18 1.0165723
    chr1_12801_13600 0.03007577 0.0003917656 6.413126e-04 6.413126e-04 -1.840330e-18 0.6108808
    chr1_14401_15200 0.02905376 0.0001423452 1.494525e-04 1.494525e-04 9.551020e-17 0.9524448
    chr1_15201_16000 0.02994487 0.0007778492 5.396820e-04 5.396820e-04 -4.221334e-17 1.4413104
    chr1_16001_16800 0.02875908 0.0001076822 9.580701e-05 9.580701e-05 -1.361187e-16 1.1239494
    log2_mean log2_var log2_var_expected
    -0.7254044 1.710113 1.585175
    -3.9306552 -2.602211 -2.625862
    -5.0552546 -11.317722 -10.603762
    -5.1051311 -12.778318 -12.731808
    -5.0615474 -10.328222 -10.873420
    -5.1198388 -13.180932 -13.352449
  2. Main diagnostic plots - <input_name>_mean_variance.png (if plot = TRUE)
    • Combined two-panel figure showing:
      • Panel A: Log2(mean) vs Log2(variance) with fitted trend line in red
      • Panel B: Log2(mean) vs Hypervariance
    • Points shown with transparency to visualize density
    • Theme: black and white with customizable font size
    • Saved as PNG format
  3. Density plot - <input_name>_fit_density.png (if plot = TRUE)
    • Point density visualization of mean-variance relationship
    • Uses viridis color scale to show point density
    • Red points show expected variance from fitted model
    • Based on randomly sampled subset of data (controlled by nrow_sample_per and seed)
    • Useful for visualizing patterns in large datasets
    • Saved as PNG format

Example Usage

library(epigenomeR)
path <- "./count_matrix/Count_Matrix_800_merged_transformed.feather"
mav_screen(transformed_cm_path = path, out_dir = "./count_matrix/", plot=TRUE)

1B. Select top-ranked regions based on hypervariance values

The filter_hvr() function identifies the most informative genomic regions by combining hypervariance-based selection with stratified sampling across expression levels.

What this function does:

  • Stratified feature selection: Divides the log2(mean) expression range into equal-sized bins to ensure balanced representation across low, medium, and high expression levels

  • Hypervariance filtering: Selects top hypervariant regions from each bin, prioritizing regions with biological variability (hypervariance > 1 indicates overdispersion)

  • Expression threshold: Applies a percentile-based cutoff to retain only highly expressed regions, removing lowly expressed features that may be noisy

  • Count matrix extraction: Retrieves corresponding count data for selected informative regions from the full transformed count matrix

  • Quality control visualization: Generates optional diagnostic plots showing:

    • Mean-variance relationship with selected regions highlighted
    • Hypervariance distribution across expression levels with selection boundaries
    • Visual confirmation of balanced sampling strategy
  • Efficient processing: Uses vectorized operations and selective column loading for fast processing of large genomic datasets

Parameters

Parameter Type Required Description Example
transformed_cm_path character Yes Path to transformed count matrix file (.feather format with “pos” column as region identifier) transformed_cm_path = "normalized_counts.feather"
mav_stats_path character Yes Path to mean-variance statistics file (.feather format) generated from upstream MAV screening analysis mav_stats_path = "mav_screen/data_mav_screen.feather"
out_dir character No (default: “./”) Output directory path for saving filtered results and optional plots out_dir = "./informative_results"
n_bins integer No (default: 100) Number of bins to divide log2(mean) range for stratified sampling, ensuring representation across expression levels n_bins = 50
keep_percent numeric No (default: 0.01) Fraction of total regions to retain (0-1). Selected regions are distributed equally across all bins keep_percent = 0.05
log2mean_quantile_thres numeric No (default: 0.99) Percentile threshold (0-1) for log2(mean) expression. Only regions above this threshold are retained in final selection log2mean_quantile_thres = 0.95
plot logical No (default: FALSE) Whether to generate diagnostic scatter plots visualizing the selection process plot = TRUE
  • Input file requirements:

    • mav_stats_path: Must be the output .feather file from mav_screen() containing columns: pos, log2_mean, log2_var, and hypervar. The function will automatically select only these required columns during loading for memory efficiency.

    • transformed_cm_path: Must be a pre-processed count matrix in .feather format. This should be the same normalized matrix used as input to mav_screen().

  • Feature filtering behavior:

    • Regions with hypervar ≤ 1 are automatically excluded as they represent features with expected or reduced variance (not informative for downstream analysis)
    • The stratified binning ensures balanced selection across expression levels, preventing bias toward highly expressed features
    • The final log2mean_quantile_thres threshold removes lowly expressed regions that may be dominated by technical noise
  • Parameter tuning recommendations: If your analysis yields too few selected regions, this typically occurs when you’re analyzing a targeted genomic subset (e.g., promoter regions, CpG islands, specific gene loci) rather than genome-wide bins. The combination of default keep_percent = 0.01 (1%) and log2mean_quantile_thres = 0.99 (99th percentile) is too stringen.

    Scenario keep_percent log2mean_quantile_thres Expected output
    Genome-wide bins (default) 0.01 (1%) 0.99 (99th) ~1,000-5,000 regions
    Targeted regions (promoters, peaks) 0.05-0.10 (5-10%) 0.90-0.95 (90-95th) ~2,000-10,000 regions
    Very sparse data 0.10-0.20 (10-20%) 0.85-0.90 (85-90th) ~5,000-20,000 regions
    Large datasets (>100K input regions) Reduce n_bins to 50 Keep defaults Faster computation

Output Files

The function generates the following output files in the specified out_dir (default: “./”):

  1. Filtered Count Matrix - <input_name>_filtered_regions.feather
    • Feather format file containing count data for filtered regions
    • First column “pos” contains region identifiers
    • Subsequent columns contain count values for each sample
    • Only includes regions passing all selection criteria:
      • Hypervariance > 1 (overdispersed)
      • Top hypervariant features within each log2(mean) bin
      • log2(mean) above the specified quantile threshold
    • Dimensions: n_selected_regions × (n_samples + 1)
pos H3K36me3-H3K36me3 H3K36me3-H3K4me1 H3K27ac-H3K36me3 H3K36me3-H3K9me2
chr1_9601_10400 0.03135214 0.02224862 0.03659935 0.03001337
chr1_10401_11200 0.02891244 0.01988311 0.03451229 0.02833411
chr1_11201_12000 0.02750192 0.01844290 0.03329854 0.02701144
chr1_12001_12800 0.02933155 0.02015533 0.03510177 0.02985421
chr1_12801_13600 0.02684492 0.01791241 0.03166388 0.02622109
chr1_13601_14400 0.02599817 0.01762288 0.03088455 0.02548933
chr1_14401_15200 0.02744561 0.01819877 0.03255291 0.02680714
chr1_15201_16000 0.02819945 0.01888922 0.03394107 0.02790365
  1. Diagnostic Plots - <input_name>_filtered_regions.png (if plot = TRUE)

This combined figure contains two complementary visualizations of the selection process:

  • Panel A: Mean-Variance Relationship
    • Scatter plot of log2(mean) vs log2(variance)
    • Shows how selected regions relate to overall mean-variance distribution
    • Highlights selected regions:
      • Light Green points: All regions
      • Light Orange points: Final selected highly-variable regions
  • Panel B: Hypervariance vs Expression
    • Scatter plot of log2(mean) vs hypervariance
    • Visualizes the selection process across expression levels
    • Three layers of points:
      • Blue points: All regions from hypervar summary
      • Red points: Regions selected from binned filtering
      • Green points: Final selected highly-variable regions after quantile filtering

Example Usage

library(epigenomeR)
mav_path <- "./count_matrix/Count_Matrix_800_merged_transformed_mav_screen.feather"
path <- "./count_matrix/Count_Matrix_800_merged_transformed.feather"
informative_regions(transformed_cm_pat = path, mav_stats_path = mav_path, out_dir = "./count_matrix/", plot=TRUE)

2. Combinatory Landscape of Co-localized CRF Pairs

2A. Automated Clustering and Heatmap Generation

The biclustering() function provides an all-in-one solution for k-means clustering and heatmap visualization. It automatically performs clustering, generates cluster assignment files, and creates publication-ready heatmaps in a single workflow.

What this function does:

  • Performs k-means clustering on both rows (genomic features) and columns (samples)

  • Uses consensus clustering for robust cluster assignments

  • Hierarchically organizes clusters for optimal visualization

  • Automatically generates and saves cluster assignment tables

  • Creates publication-ready heatmaps with customizable aesthetics

  • Handles single or multiple count matrices with optional overlap analysis

  • Provides reproducible results with seed control

  • Eliminates manual workflow steps by integrating clustering and visualization

Parameters

Parameter Type Required Description Example
cm_path character Yes Path to count matrix .feather file with ‘pos’ column (rows=genomic positions, cols=samples) cm_path = "normalized_counts.feather"
row_km integer Yes Number of k-means clusters for rows (genomic features/regions) row_km = 15
col_km integer Yes Number of k-means clusters for columns (samples/conditions) col_km = 16
out_dir character Yes Directory to save cluster assignment files and heatmap output out_dir = "./clustering_results"
seed integer No (default: 123) Random seed for reproducible clustering results seed = 42
plot logical No (default: TRUE) Whether to generate and save heatmap visualization plot = FALSE
show_column_names logical No (default: FALSE) Whether to display column names at the bottom of the heatmap show_column_names = TRUE
lower_range numeric No (default: NULL) Lower bound for heatmap color scale. If NULL, uses minimum value from data lower_range = 0
upper_range numeric No (default: NULL) Upper bound for heatmap color scale. If NULL, uses maximum value from data upper_range = 10
row_title_fontsize numeric No (default: NULL) Font size for row cluster titles (A, B, C, etc.) row_title_fontsize = 40
col_title_fontsize numeric No (default: NULL) Font size for column cluster titles (1, 2, 3, etc.) col_title_fontsize = 22
legend_title_fontsize numeric No (default: NULL) Font size for legend title legend_title_fontsize = 40
legend_label_fontsize numeric No (default: NULL) Font size for legend tick labels legend_label_fontsize = 30

Output Files

The function generates the following output files in the specified out_dir:

  1. Row cluster assignments - row_table.tsv
    • Tab-separated file with row cluster assignments
    • Two columns:
      • feature: Genomic region identifier (e.g., “chr1_1000_2000”)
      • label: Cluster label as letter (A, B, C, D, etc.)
    • Sorted by cluster label for easy inspection
    • Rows within each cluster are ordered by hierarchical clustering
    • Used for annotating genomic regions by cluster membership
      A tibble: 20 × 2
      feature label
      <chr> <chr>
      1 chr16_89995201_89996000 A
      2 chr3_10401_11200 A
      3 chrX_156029601_156030400 A
      4 chr21_8234401_8235200 A
      5 chr7_64849601_64850400 A
      6 chr2_85993601_85994400 A
      7 chr7_106568801_106569600 B
      8 chr19_4374401_4375200 B
      9 chr10_72320801_72321600 B
      10 chr1_30768801_30769600 B
      11 chr1_28259201_28260000 C
      12 chr16_1379201_1380000 C
      13 chr6_43771201_43772000 C
      14 chr16_28824001_28824800 C
      15 chr11_65497601_65498400 C
      16 chr6_31958401_31959200 D
      17 chr18_3448001_3448800 D
      18 chr19_1044801_1045600 D
      19 chr17_75046401_75047200 D
      20 chr2_85602401_85603200 D
  2. Column cluster assignments - col_table.tsv
    • Tab-separated file with column (sample) cluster assignments
    • Two columns:
      • feature: Sample identifier (e.g., “YY1-cJun”)
      • label: Cluster label as number (1, 2, 3, etc.)
    • Sorted by cluster label
    • Columns within each cluster are ordered by hierarchical clustering
    • Used for grouping samples by similarity
      A tibble: 12 × 2
      feature label
      <chr> <dbl>
      1 H3K9me3-H3K9me3 1
      2 H3K27ac-H3K4me3 2
      3 H3K4me1-H3K9me3 2
      4 H3K9me2-H3K9me3 2
      5 H3K27ac-H3K4me1 2
      6 H3K27ac-H3K27me3 2
      7 H3K27ac-H3K9me3 2
      8 H3K27ac-H3K27ac 2
      9 H3K4me1-H3K4me3 3
      10 H3K27me3-H3K4me3 3
      11 H3K4me3-H3K9me3 3
      12 H3K4me3-H3K4me3 4
  3. Clustered heatmap - figures/<basename>_figS3A.pdf
    • Publication-ready heatmap in PDF format
    • Features:
      • Rows organized by k-means clusters (labeled A, B, C, etc.)
      • Columns organized by k-means clusters (labeled 1, 2, 3, etc.)
      • Color scale: Blue (low) → White (middle) → Red (high)
      • Horizontal legend positioned at bottom
      • White gaps between clusters for visual separation (3mm)
      • Customizable font sizes for titles and labels
      • Transparent background for publication flexibility
    • Dimensions automatically calculated based on matrix size (5mm per feature/sample)

Example Usage

library(epigenomeR)
path <- "/dcs05/hongkai/data/next_cutntag/bulk/wgc/mixed/800/V1V2_mixed_800_colQC-all-qc_libnorm_noAllZero_log2_qnorm_gam-40_per-0.01_mean-0.99.feather"
biclustering(cm_path = path, row_km = 15, col_km = 16, out_dir = "/users/yhu1/next_cutntag/mav_screen/figures/test_plots")

Notes

  • Input requirements:

    • Count matrix must be in .feather format
    • First column must be named “pos” containing feature identifiers
    • Matrix must contain numeric values only
  • Reproducibility: Always set seed parameter for reproducible results, as k-means involves random initialization.

  • Color scale interpretation:

    • Default: Auto-scales to data range
    • Custom: Use lower_range and upper_range for consistent scales across analyses
    • Blue = Low signal, White = Medium, Red = High signal

2B. (Optional) Add Non-Informative Regions Back to Clusters

The add_regions_back_to_cluster() function assigns cluster labels to genomic regions that were excluded from the informative set by correlating them with existing cluster signatures derived from informative regions.

What this function does:

  • Identifies regions excluded from informative clustering analysis

  • Filters non-informative regions based on non-zero count threshold

  • Computes cluster signatures by aggregating informative regions within each cluster

  • Calculates correlations between non-informative regions and cluster signatures

  • Assigns cluster labels based on maximum correlation values

  • Applies quantile-based filtering to retain high-confidence assignments

  • Generates comprehensive feature-label table with priority-based classification

  • Provides optional diagnostic histogram of correlation distributions

Parameters

Parameter Type Required Description Example
nozero_file_path character Yes Path to feather file containing all non-zero regions with ‘pos’ column as first column nozero_file_path = "all_nonzero_regions.feather"
transformed_file_path character Yes Path to feather file with transformed count data (e.g., normalized, log-transformed) for correlation analysis transformed_file_path = "qnorm_counts.feather"
informative_path character Yes Path to feather file containing informative/significant regions used for original clustering informative_path = "informative_regions.feather"
cluster_path character Yes Path to TSV/TXT file with ‘feature’ and ‘label’ columns defining cluster assignments for informative regions cluster_path = "row_clusters.tsv"
out_path character Yes File path to save final feature-label table (supports .feather, .tsv, .csv formats) out_path = "./results/all_regions_labeled.feather"
save_plot logical No (default: FALSE) Whether to generate and save correlation distribution histogram save_plot = TRUE
plot_path character No (default: NULL) File path for saving histogram plot (PNG format). Required if save_plot = TRUE plot_path = "./plots/correlation_dist.png"
cutoff_non_zero integer No (default: 10) Minimum number of non-zero samples required per region. Regions with MORE than this many non-zero values are kept cutoff_non_zero = 15
quantile_threshold numeric No (default: 0.75) Quantile threshold (0-1) for filtering high-correlation regions. Only regions above this quantile are assigned cluster labels quantile_threshold = 0.80

Output Files

The function generates the following output files:

  1. Feature-label assignment table - <out_path>
    • Comprehensive table containing all genomic regions with assigned labels
    • Two columns:
      • feature: Genomic region identifier (chromosome coordinates)
      • label: Assigned cluster label or category
    • Label priority hierarchy (highest to lowest):
      • Cluster assignments: From cluster_path (original informative region clusters)
      • Correlation-based assignments: High-correlation regions matched to cluster signatures
      • CRF_specific: Regions passing non-zero filter but below correlation threshold
      • Background: All other regions with non-zero counts
    • File format determined by extension (.feather, .tsv, or .csv)
    • Includes all non-zero regions from the genome
    feature label
    chr1_9601_10400 J
    chr1_10401_11200 CRF_specific
    chr1_12801_13600 Background
    chr1_14401_15200 Background
    chr1_15201_16000 Background
    chr1_16001_16800 Background
  2. Correlation distribution histogram - <plot_path> (if save_plot = TRUE)
    • PNG format visualization showing distribution of maximum correlation values
    • X-axis: Maximum correlation between regions and cluster signatures
    • Y-axis: Frequency of regions
    • Red dashed line: Quantile threshold for filtering
    • Legend: Shows quantile threshold value
    • Resolution: 800 × 600 pixels
    • Helps visualize correlation quality and threshold selection
    • Color scheme: Light blue bars with black borders

Example Usage

library(epigenomeR)
nozero_file_path <- "./V1V2_mixed_800_colQC-all-qc_libnorm_noAllZero.feather"
transformed_file_path <- "./V1V2_mixed_800_colQC-all-qc_libnorm_noAllZero_log2_qnorm.feather"
informative_path <- "./V1V2_mixed_800_colQC-all-qc_libnorm_noAllZero_log2_qnorm_gam-40_per-0.01_mean-0.99.feather"
plot_path <- "output/AAA_hist_for_add_regions.png"
cluster_path <- "./bicluster_V_mixed_all-qc_kmeans_euclidean_row_num-15_column_num-16_heatmap_row_clusters_manually_reorganized.tsv"
out_path <- "output/AAA_add_regions_to_cluster_result.tsv"

result <- add_regions_back_to_cluster(nozero_file_path, transformed_file_path, informative_path, cluster_path, out_path, save_plot = TRUE, plot_path = plot_path, cutoff_non_zero = 10, quantile_threshold = 0.75)
head(result)
dim(result)
nrow(result)

Notes

  • This function must be run after clustering analysis of informative regions is complete.

  • Input file requirements:

    • All feather files must have ‘pos’ column as first column
    • cluster_path TSV must have ‘feature’ and ‘label’ columns
    • Feature names must be consistent across all input files
  • Correlation calculation: Uses Pearson correlation with complete observations (pairwise deletion of NA values).

  • Non-zero filtering: The cutoff_non_zero = 10 means regions must have MORE than 10 non-zero samples (not equal to 10).

  • Quantile threshold interpretation: quantile_threshold = 0.75 means only regions with correlation in the top 25% receive cluster assignments.

  • Output format flexibility: Supports .feather (fast, compact), .tsv (human-readable), or .csv (universal compatibility).


2C. Apply Pre-Computed Clusters to Generate Heatmaps

After add non-informative regions back to cluster, you can use apply_cluster_heatmap() function to redraw the heatmap. The apply_cluster_heatmap() function is a wrapper that applies pre-computed cluster assignments to one or multiple count matrices to generate publication-ready heatmaps. It handles data loading, validation, and optional comparative analysis across matrices.

What this function does:

  • Applies existing cluster assignments to count matrices without re-clustering

  • Validates input files and data integrity

  • Handles single or multiple count matrices with flexible processing modes

  • Supports comparative analysis by identifying shared features across matrices

  • Orders features and samples according to cluster assignments

  • Generates publication-ready heatmaps with customizable aesthetics

  • Provides consistent color scaling across multiple matrices

  • Creates organized output directory structure

Parameters

Parameter Type Required Description Example
count_matrix_file_path character vector Yes Vector of file paths to count matrix .feather files. Each file must have ‘pos’ column (rows=genomic positions, cols=samples) count_matrix_file_path = c("matrix1.feather", "matrix2.feather")
row_cluster_file_path character Yes Path to row cluster assignment .tsv file with ‘feature’ and ‘label’ columns containing position-to-cluster mappings row_cluster_file_path = "row_clusters.tsv"
col_cluster_file_path character Yes Path to column cluster assignment .tsv file with ‘feature’ and ‘label’ columns containing sample-to-cluster mappings col_cluster_file_path = "col_clusters.tsv"
out_dir character No (default: NULL) Directory to save heatmap output. If NULL, uses input file’s directory out_dir = "./heatmaps"
seed integer No (default: 123) Random seed for reproducible results (affects any stochastic elements) seed = 42
show_dend_boolean logical No (default: FALSE) Whether to show column names in heatmap show_dend_boolean = TRUE
count_matrix_overlap logical No (default: FALSE) Whether to analyze only shared positions and samples across matrices for comparative analysis count_matrix_overlap = TRUE
lower_range numeric No (default: NULL) Lower bound for heatmap color scale. If NULL, auto-determined from data lower_range = 0
upper_range numeric No (default: NULL) Upper bound for heatmap color scale. If NULL, auto-determined from data upper_range = 10
row_title_fontsize numeric No (default: NULL) Font size for row cluster titles (A, B, C, etc.). If NULL, uses default row_title_fontsize = 40
col_title_fontsize numeric No (default: NULL) Font size for column cluster titles (1, 2, 3, etc.). If NULL, uses default col_title_fontsize = 22
legend_title_fontsize numeric No (default: NULL) Font size for legend title. If NULL, uses default legend_title_fontsize = 40
legend_label_fontsize numeric No (default: NULL) Font size for legend tick labels. If NULL, uses default legend_label_fontsize = 30

Output Files

The function generates the following output files in the specified out_dir:

  1. Clustered heatmap(s) - figures/<basename>_figS3A.pdf
    • Publication-ready heatmap(s) in PDF format
    • One PDF file per input count matrix
    • Features:
      • Rows ordered and split by cluster assignments from row_cluster_file_path
      • Columns ordered and split by cluster assignments from col_cluster_file_path
      • Row clusters labeled with letters (A, B, C, etc.)
      • Column clusters labeled with numbers (1, 2, 3, etc.)
      • Color scale: Blue (#3155C3) → White → Red (#AF0525)
      • Horizontal legend positioned at bottom with title “Normalized Read Counts”
      • White gaps between clusters (3mm) for visual separation
      • Transparent background for publication flexibility
      • No dendrograms (unless show_dend_boolean = TRUE shows column names)
    • Dimensions automatically calculated based on matrix size (5mm per feature/sample)
    • File naming: <input_matrix_basename>_figS3A.pdf

Example Usage

library(epigenomeR)
result <- add_regions_back_to_cluster(nozero_file_path, transformed_file_path, informative_path, cluster_path, out_path, save_plot = TRUE, plot_path = plot_path, cutoff_non_zero = 10, quantile_threshold = 0.75)
head(result)
dim(result)
nrow(result)

# save the new row cluster
filtered_result <- result[!result$label %in% c("Background", "CRF_specific"), ]
row_output_path <- "output/AAA_row_cluster_for_heatmap.tsv"
write.table(filtered_result, row_output_path, sep = "\t", row.names = FALSE, quote = FALSE)

# rearrange the order
informative_regions <- read_feather(informative_path)
clusters <- rowcol_km_like_ComplexHeatmap(mat = informative_regions, row_k = 15, col_k = 16) 
cluster_group_col <- data.frame(clusters$col_num)
colnames(cluster_group_col) <- "label"
cluster_group_col <- cluster_group_col %>% tibble::rownames_to_column("feature")
col_output_path <- "output/AAA_col_cluster_for_heatmap.tsv"
write.table(cluster_group_col, col_output_path, sep = "\t", row.names = FALSE, quote = FALSE)

apply_cluster_heatmap(count_matrix_file_path = transformed_file_path, row_cluster_file_path = path1, col_cluster_file_path = path2, out_dir = "output/AAA_add_regions_heatmap", lower_range = 0, upper_range = 4.5)

3. Genomic Ranges’ Annotation

3A. Annotate and Plot Regulatory Element Composition for Clustered Regions

The annotation_ccre_hmm() function analyzes the genomic annotation composition of clustered regions by overlapping with CCRE (Candidate Cis-Regulatory Elements) and ChromHMM chromatin state annotations, then generates comparative bar plots showing regulatory element distribution across clusters.

What this function does:

  • Parses genomic coordinates from cluster assignment files

  • Converts position strings to GenomicRanges objects for overlap analysis

  • Overlaps clustered regions with CCRE annotations (cell-type agnostic)

  • Overlaps clustered regions with ChromHMM chromatin state annotations

  • Calculates percentage composition of regulatory elements per cluster

  • Generates publication-ready stacked bar plots for visual comparison

  • Excludes background and epitope-specific regions from analysis

  • Provides quantitative characterization of cluster regulatory landscapes

Parameters

Parameter Type Required Description Example
row_cluster_file_path character Yes Path to row cluster assignment .tsv file with ‘feature’ column (genomic positions as “chr_start_end”) and ‘label’ column (cluster assignments) row_cluster_file_path = "row_clusters.tsv"
out_dir character Yes Directory to save annotation plot outputs (PDF format) out_dir = "./annotation_plots"

Output Files

The function generates the following output files in the specified out_dir:

  1. CCRE composition bar plot - p_ccre_agnostic.pdf
    • Stacked horizontal bar plot showing CCRE annotation distribution across clusters
    • Each bar represents one cluster (labeled A, B, C, etc. or 1, 2, 3, etc.)
    • Bar segments colored by CCRE category:
      • Promoter-like signatures (pELS)
      • Enhancer-like signatures (dELS)
      • CTCF-only elements
      • DNase-H3K4me3 elements
      • Other regulatory categories
    • X-axis: Percentage (0-100%)
    • Y-axis: Cluster labels (reversed order, top to bottom)
    • Legend on right side with category labels
    • Dimensions: 6 × 5 inches
    • Title: “Cis-regulatory Elements (cell type agnostic)”
    • Cell-type agnostic annotations provide broad regulatory classification
  2. ChromHMM composition bar plot - p_chromhmm_short.pdf
    • Stacked horizontal bar plot showing ChromHMM chromatin state distribution across clusters
    • Each bar represents one cluster
    • Bar segments colored by ChromHMM state:
      • Active TSS (transcription start sites)
      • Flanking TSS
      • Strong enhancers
      • Weak enhancers
      • Transcribed regions
      • Heterochromatin
      • Quiescent/Low signal
      • Other chromatin states (15-state model)
    • X-axis: Percentage (0-100%)
    • Y-axis: Cluster labels (reversed order, top to bottom)
    • Legend on right side with state labels
    • Dimensions: 6 × 5 inches
    • Title: “ChromHMM Elements”
    • Provides functional chromatin state annotation

Example Usage

library(epigenomeR)
row_cluster_file_path <- "./bicluster_row_cluster.tsv"
out_dir <- "output/figures"

annotation_ccre_hmm(row_cluster_file_path = row_cluster_file_path, out_dir = out_dir)

3B. Biclustering TFBS Annotation Pipeline

The biclustering_annotation_TFBS() function provides a complete, automated workflow for analyzing transcription factor binding site (TFBS) enrichment across multiple clusters of genomic regions. It handles the entire pipeline from control region generation to enrichment testing and visualization.

What this function does:

  • Reads a file containing clustered genomic regions (e.g., from biclustering analysis)

  • Generates matched control regions for each cluster using gene-based matching

  • Performs TFBS enrichment analysis comparing each cluster against controls

  • Creates heatmap visualizations showing enrichment patterns across clusters

  • Saves all intermediate and final results to organized output files

Parameters

Parameter Type Required Description Example
row_cluster_file_path character Yes Path to tab-delimited file with clustered regions. Must have columns: feature (format: chr_start_end) and label (cluster ID) "bicluster_results.tsv"
out_dir character No (default: “./”) Output directory where all results will be saved "./TFBS_results/"
ref_genome character No (default: “hg38”) Reference genome version. Options: “hg38”, “hg19”, “mm10”, “mm39” ref_genome = "mm10"
control_rep integer No (default: 1) Multiplier for control region generation. Determines the ratio of control to target regions. E.g., control_rep = 2 generates 2x control regions; control_rep = 5 generates 5x control regions control_rep = 3
regions integer No (default: 800) Size in base pairs to resize all regions (centered on original midpoint) regions = 500
plot logical No (default: TRUE) Whether to generate heatmap visualizations. If FALSE, only performs enrichment analysis plot = FALSE

Input File Format:

The cluster file must be tab-delimited with at least these columns:

feature              label
chr1_1234567_1235567 cluster_1
chr2_9876543_9877543 cluster_1
chr3_5555555_5556555 cluster_2
...
  • feature: Genomic coordinates in format “chr_start_end” (underscore-separated)
  • label: Cluster assignment (e.g., “cluster_1”, “group_A”, “bicluster_2”)

Output Files

All output files are saved to the specified out_dir:

File Description
all_controls.bed BED file containing all matched control regions (combined and de-duplicated)
TFBS_enrichment_cluster_<label>.tsv One TSV per cluster with enrichment statistics (odds ratio, p-value, FDR, hit counts)
TFBS_heatmap_all.pdf Heatmap visualization of enriched motifs across all clusters (if plot = TRUE)
odds_ratio_log2.csv Matrix of log2 odds ratios for all TFs × clusters (if plot = TRUE)
FDR.csv Matrix of FDR values for all TFs × clusters (if plot = TRUE)

Usage Examples

library(epigenomeR)

# Basic usage - complete pipeline with visualization
biclustering_annotation_TFBS(
  row_cluster_file_path = "NMF_clusters.tsv",
  out_dir = "./TFBS_analysis/",
  ref_genome = "hg38"
)

# Custom region size for enhancer analysis
biclustering_annotation_TFBS(
  row_cluster_file_path = "enhancer_clusters.tsv",
  out_dir = "./enhancer_TFBS/",
  ref_genome = "hg38",
  regions = 1000
)

# Mouse genome analysis
biclustering_annotation_TFBS(
  row_cluster_file_path = "mouse_peaks_clustered.tsv",
  out_dir = "./mouse_TFBS/",
  ref_genome = "mm10",
  regions = 500
)

# Generate 3x more control regions for more robust statistics
biclustering_annotation_TFBS(
  row_cluster_file_path = "ATAC_peaks_clustered.tsv",
  out_dir = "./ATAC_TFBS/",
  ref_genome = "hg38",
  control_rep = 3,
  regions = 800
)

# Enrichment analysis only (no heatmap)
biclustering_annotation_TFBS(
  row_cluster_file_path = "clusters.tsv",
  out_dir = "./TFBS_tables/",
  ref_genome = "hg38",
  plot = FALSE
)