This workflow provides two complementary approaches for identifying highly variable genomic regions:
Integrated workflow (recommended): The
detect_hvr() function combines mean-variance modeling and
feature selection in a single step, internally calling
mav_screen() and filter_hvr() in sequence.
library(epigenomeR)
path <- "./count_matrix/Count_Matrix_800_merged_transformed.feather"
detect_hvr(transformed_cm_path = path, out_dir = "./count_matrix/")Step-by-step workflow: For users who need
fine-grained control or want to inspect intermediate results, the
individual functions mav_screen() and
filter_hvr() can be called separately.
Function Overview:
detect_hvr() - Complete pipeline
for detecting highly variable regions (wrapper function)
mav_screen() - Step 1: Model
mean-variance relationship and calculate hypervariance metrics
filter_hvr() - Step 2: Select
top-ranked regions based on hypervariance values
The mav_screen() function models the mean-variance
relationship in transformed count data and normalizes features by their
expected variance. It computes hypervariance metrics that quantify each
feature’s deviation from the expected mean-variance trend.
What this function does:
This function identifies overdispersed genomic features by modeling the mean-variance relationship using GAM or LOESS regression.
The function fits a regression model to characterize the relationship between feature mean and variance in log2 space across all genomic regions. From this fitted model, it calculates the expected variance for each feature based on its mean expression level. Each feature is then normalized by centering on its observed mean and scaling by the expected standard deviation derived from the model fit.
The function computes hypervariance metrics (sum of squared normalized values divided by degrees of freedom) to quantify the degree of overdispersion for each feature. Features with higher hypervariance values exhibit greater variability than expected from the mean-variance trend, indicating biological overdispersion. All statistics including observed mean/variance, expected variance, normalized statistics, and hypervariance metrics are saved to a feather file.
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
transformed_cm_path |
character | Yes | Path to transformed count matrix file (.feather format with “pos” column) | transformed_cm_path = "data_transformed.feather" |
out_dir |
character | No (default: “./”) | Output directory path for results | out_dir = "./mav_screen_results" |
fitting_model |
character | No (default: “gam”) | Model to fit mean-variance relationship: “gam” or “loess” | fitting_model = "loess" |
k |
integer | No (default: 40) | Number of basis functions for GAM spline fitting (only used when fitting_model = “gam”) | k = 30 |
span |
numeric | No (default: 0.5) | Smoothing span for LOESS fitting (only used when fitting_model = “loess”) | span = 0.3 |
seed |
integer | No (default: 42) | Random seed for reproducibility in sampling | seed = 123 |
font_size |
numeric | No (default: 10) | Font size for plot labels and axes | font_size = 12 |
nrow_sample_per |
numeric | No (default: 0.2) | Proportion of rows (0-1) to randomly sample for density plot | nrow_sample_per = 0.1 |
plot |
logical | No (default: FALSE) | Whether to generate and save diagnostic plots (.png files) | plot = TRUE |
Input data requirements:
The transformed_cm_path parameter refers directly to a
pre-processed count matrix stored as a .feather file. This
file must not contain raw counts. Instead, the matrix should already
be:
Additionally, the .feather file must have the first
column named “pos” containing feature identifiers (e.g.,
chr1_9601_10400).
Model selection:
Memory considerations: For very large datasets,
adjust nrow_sample_per to reduce memory usage for density
plots.
The function generates the following output files in the specified
out_dir (default: “./”):
<input_name>_mav_screen.feather
pos: Original region identifiermean: Mean expression across all samplesvar: Variance across all samplesvar_expect: Expected variance from fitted mean-variance
modelnorm_mean: Mean of normalized values (should be
~0)norm_var: Variance of normalized values (should be
~1)hypervar: Hypervariance, sum of squared normalized
values divided by (n-1)log2_mean: Log2-transformed mean expressionlog2_var: Log2-transformed variancelog2_var_expected: Log2-transformed expected variance
from fitted model| pos | mean | var | var_expect | norm_mean | norm_var | hypervar |
|---|---|---|---|---|---|---|
| chr1_9601_10400 | 0.60482747 | 3.2718638368 | 2.999402e+00 | 2.999402e+00 | -2.400220e-17 | 1.0908388 |
| chr1_10401_11200 | 0.06557750 | 0.1646858526 | 1.620011e-01 | 1.620011e-01 | -6.149439e-18 | 1.0165723 |
| chr1_12801_13600 | 0.03007577 | 0.0003917656 | 6.413126e-04 | 6.413126e-04 | -1.840330e-18 | 0.6108808 |
| chr1_14401_15200 | 0.02905376 | 0.0001423452 | 1.494525e-04 | 1.494525e-04 | 9.551020e-17 | 0.9524448 |
| chr1_15201_16000 | 0.02994487 | 0.0007778492 | 5.396820e-04 | 5.396820e-04 | -4.221334e-17 | 1.4413104 |
| chr1_16001_16800 | 0.02875908 | 0.0001076822 | 9.580701e-05 | 9.580701e-05 | -1.361187e-16 | 1.1239494 |
| log2_mean | log2_var | log2_var_expected |
|---|---|---|
| -0.7254044 | 1.710113 | 1.585175 |
| -3.9306552 | -2.602211 | -2.625862 |
| -5.0552546 | -11.317722 | -10.603762 |
| -5.1051311 | -12.778318 | -12.731808 |
| -5.0615474 | -10.328222 | -10.873420 |
| -5.1198388 | -13.180932 | -13.352449 |
<input_name>_mean_variance.png (if
plot = TRUE)

<input_name>_fit_density.png (if
plot = TRUE)
nrow_sample_per and seed)
library(epigenomeR)
path <- "./count_matrix/Count_Matrix_800_merged_transformed.feather"
mav_screen(transformed_cm_path = path, out_dir = "./count_matrix/", plot=TRUE)
The filter_hvr() function identifies the most
informative genomic regions by combining hypervariance-based selection
with stratified sampling across expression levels.
What this function does:
Stratified feature selection: Divides the log2(mean) expression range into equal-sized bins to ensure balanced representation across low, medium, and high expression levels
Hypervariance filtering: Selects top hypervariant regions from each bin, prioritizing regions with biological variability (hypervariance > 1 indicates overdispersion)
Expression threshold: Applies a percentile-based cutoff to retain only highly expressed regions, removing lowly expressed features that may be noisy
Count matrix extraction: Retrieves corresponding count data for selected informative regions from the full transformed count matrix
Quality control visualization: Generates optional diagnostic plots showing:
Efficient processing: Uses vectorized operations and selective column loading for fast processing of large genomic datasets
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
transformed_cm_path |
character | Yes | Path to transformed count matrix file (.feather format with “pos” column as region identifier) | transformed_cm_path = "normalized_counts.feather" |
mav_stats_path |
character | Yes | Path to mean-variance statistics file (.feather format) generated from upstream MAV screening analysis | mav_stats_path = "mav_screen/data_mav_screen.feather" |
out_dir |
character | No (default: “./”) | Output directory path for saving filtered results and optional plots | out_dir = "./informative_results" |
n_bins |
integer | No (default: 100) | Number of bins to divide log2(mean) range for stratified sampling, ensuring representation across expression levels | n_bins = 50 |
keep_percent |
numeric | No (default: 0.01) | Fraction of total regions to retain (0-1). Selected regions are distributed equally across all bins | keep_percent = 0.05 |
log2mean_quantile_thres |
numeric | No (default: 0.99) | Percentile threshold (0-1) for log2(mean) expression. Only regions above this threshold are retained in final selection | log2mean_quantile_thres = 0.95 |
plot |
logical | No (default: FALSE) | Whether to generate diagnostic scatter plots visualizing the selection process | plot = TRUE |
Input file requirements:
mav_stats_path: Must be the output
.feather file from mav_screen() containing
columns: pos, log2_mean,
log2_var, and hypervar. The function will
automatically select only these required columns during loading for
memory efficiency.
transformed_cm_path: Must be a
pre-processed count matrix in .feather format. This should
be the same normalized matrix used as input to
mav_screen().
Feature filtering behavior:
hypervar ≤ 1 are automatically excluded as
they represent features with expected or reduced variance (not
informative for downstream analysis)log2mean_quantile_thres threshold removes
lowly expressed regions that may be dominated by technical noiseParameter tuning recommendations: If your
analysis yields too few selected regions, this typically occurs when
you’re analyzing a targeted genomic subset (e.g.,
promoter regions, CpG islands, specific gene loci) rather than
genome-wide bins. The combination of default
keep_percent = 0.01 (1%) and
log2mean_quantile_thres = 0.99 (99th percentile) is too
stringen.
| Scenario | keep_percent |
log2mean_quantile_thres |
Expected output |
|---|---|---|---|
| Genome-wide bins (default) | 0.01 (1%) | 0.99 (99th) | ~1,000-5,000 regions |
| Targeted regions (promoters, peaks) | 0.05-0.10 (5-10%) | 0.90-0.95 (90-95th) | ~2,000-10,000 regions |
| Very sparse data | 0.10-0.20 (10-20%) | 0.85-0.90 (85-90th) | ~5,000-20,000 regions |
| Large datasets (>100K input regions) | Reduce n_bins to 50 |
Keep defaults | Faster computation |
The function generates the following output files in the specified
out_dir (default: “./”):
<input_name>_filtered_regions.feather
| pos | H3K36me3-H3K36me3 | H3K36me3-H3K4me1 | H3K27ac-H3K36me3 | H3K36me3-H3K9me2 |
|---|---|---|---|---|
| chr1_9601_10400 | 0.03135214 | 0.02224862 | 0.03659935 | 0.03001337 |
| chr1_10401_11200 | 0.02891244 | 0.01988311 | 0.03451229 | 0.02833411 |
| chr1_11201_12000 | 0.02750192 | 0.01844290 | 0.03329854 | 0.02701144 |
| chr1_12001_12800 | 0.02933155 | 0.02015533 | 0.03510177 | 0.02985421 |
| chr1_12801_13600 | 0.02684492 | 0.01791241 | 0.03166388 | 0.02622109 |
| chr1_13601_14400 | 0.02599817 | 0.01762288 | 0.03088455 | 0.02548933 |
| chr1_14401_15200 | 0.02744561 | 0.01819877 | 0.03255291 | 0.02680714 |
| chr1_15201_16000 | 0.02819945 | 0.01888922 | 0.03394107 | 0.02790365 |
<input_name>_filtered_regions.png (if
plot = TRUE)This combined figure contains two complementary visualizations of the selection process:

library(epigenomeR)
mav_path <- "./count_matrix/Count_Matrix_800_merged_transformed_mav_screen.feather"
path <- "./count_matrix/Count_Matrix_800_merged_transformed.feather"
informative_regions(transformed_cm_pat = path, mav_stats_path = mav_path, out_dir = "./count_matrix/", plot=TRUE)
The biclustering() function provides an all-in-one
solution for k-means clustering and heatmap visualization. It
automatically performs clustering, generates cluster assignment files,
and creates publication-ready heatmaps in a single workflow.
What this function does:
Performs k-means clustering on both rows (genomic features) and columns (samples)
Uses consensus clustering for robust cluster assignments
Hierarchically organizes clusters for optimal visualization
Automatically generates and saves cluster assignment tables
Creates publication-ready heatmaps with customizable aesthetics
Handles single or multiple count matrices with optional overlap analysis
Provides reproducible results with seed control
Eliminates manual workflow steps by integrating clustering and visualization
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
cm_path |
character | Yes | Path to count matrix .feather file with ‘pos’ column (rows=genomic positions, cols=samples) | cm_path = "normalized_counts.feather" |
row_km |
integer | Yes | Number of k-means clusters for rows (genomic features/regions) | row_km = 15 |
col_km |
integer | Yes | Number of k-means clusters for columns (samples/conditions) | col_km = 16 |
out_dir |
character | Yes | Directory to save cluster assignment files and heatmap output | out_dir = "./clustering_results" |
seed |
integer | No (default: 123) | Random seed for reproducible clustering results | seed = 42 |
plot |
logical | No (default: TRUE) | Whether to generate and save heatmap visualization | plot = FALSE |
show_column_names |
logical | No (default: FALSE) | Whether to display column names at the bottom of the heatmap | show_column_names = TRUE |
lower_range |
numeric | No (default: NULL) | Lower bound for heatmap color scale. If NULL, uses minimum value from data | lower_range = 0 |
upper_range |
numeric | No (default: NULL) | Upper bound for heatmap color scale. If NULL, uses maximum value from data | upper_range = 10 |
row_title_fontsize |
numeric | No (default: NULL) | Font size for row cluster titles (A, B, C, etc.) | row_title_fontsize = 40 |
col_title_fontsize |
numeric | No (default: NULL) | Font size for column cluster titles (1, 2, 3, etc.) | col_title_fontsize = 22 |
legend_title_fontsize |
numeric | No (default: NULL) | Font size for legend title | legend_title_fontsize = 40 |
legend_label_fontsize |
numeric | No (default: NULL) | Font size for legend tick labels | legend_label_fontsize = 30 |
The function generates the following output files in the specified
out_dir:
row_table.tsv
feature: Genomic region identifier (e.g.,
“chr1_1000_2000”)label: Cluster label as letter (A, B, C, D, etc.)| feature | label | |
|---|---|---|
| <chr> | <chr> | |
| 1 | chr16_89995201_89996000 | A |
| 2 | chr3_10401_11200 | A |
| 3 | chrX_156029601_156030400 | A |
| 4 | chr21_8234401_8235200 | A |
| 5 | chr7_64849601_64850400 | A |
| 6 | chr2_85993601_85994400 | A |
| 7 | chr7_106568801_106569600 | B |
| 8 | chr19_4374401_4375200 | B |
| 9 | chr10_72320801_72321600 | B |
| 10 | chr1_30768801_30769600 | B |
| 11 | chr1_28259201_28260000 | C |
| 12 | chr16_1379201_1380000 | C |
| 13 | chr6_43771201_43772000 | C |
| 14 | chr16_28824001_28824800 | C |
| 15 | chr11_65497601_65498400 | C |
| 16 | chr6_31958401_31959200 | D |
| 17 | chr18_3448001_3448800 | D |
| 18 | chr19_1044801_1045600 | D |
| 19 | chr17_75046401_75047200 | D |
| 20 | chr2_85602401_85603200 | D |
col_table.tsv
feature: Sample identifier (e.g., “YY1-cJun”)label: Cluster label as number (1, 2, 3, etc.)| feature | label | |
|---|---|---|
| <chr> | <dbl> | |
| 1 | H3K9me3-H3K9me3 | 1 |
| 2 | H3K27ac-H3K4me3 | 2 |
| 3 | H3K4me1-H3K9me3 | 2 |
| 4 | H3K9me2-H3K9me3 | 2 |
| 5 | H3K27ac-H3K4me1 | 2 |
| 6 | H3K27ac-H3K27me3 | 2 |
| 7 | H3K27ac-H3K9me3 | 2 |
| 8 | H3K27ac-H3K27ac | 2 |
| 9 | H3K4me1-H3K4me3 | 3 |
| 10 | H3K27me3-H3K4me3 | 3 |
| 11 | H3K4me3-H3K9me3 | 3 |
| 12 | H3K4me3-H3K4me3 | 4 |
figures/<basename>_figS3A.pdf

library(epigenomeR)
path <- "/dcs05/hongkai/data/next_cutntag/bulk/wgc/mixed/800/V1V2_mixed_800_colQC-all-qc_libnorm_noAllZero_log2_qnorm_gam-40_per-0.01_mean-0.99.feather"
biclustering(cm_path = path, row_km = 15, col_km = 16, out_dir = "/users/yhu1/next_cutntag/mav_screen/figures/test_plots")
Input requirements:
Reproducibility: Always set seed
parameter for reproducible results, as k-means involves random
initialization.
Color scale interpretation:
lower_range and upper_range
for consistent scales across analysesThe add_regions_back_to_cluster() function assigns
cluster labels to genomic regions that were excluded from the
informative set by correlating them with existing cluster signatures
derived from informative regions.
What this function does:
Identifies regions excluded from informative clustering analysis
Filters non-informative regions based on non-zero count threshold
Computes cluster signatures by aggregating informative regions within each cluster
Calculates correlations between non-informative regions and cluster signatures
Assigns cluster labels based on maximum correlation values
Applies quantile-based filtering to retain high-confidence assignments
Generates comprehensive feature-label table with priority-based classification
Provides optional diagnostic histogram of correlation distributions
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
nozero_file_path |
character | Yes | Path to feather file containing all non-zero regions with ‘pos’ column as first column | nozero_file_path = "all_nonzero_regions.feather" |
transformed_file_path |
character | Yes | Path to feather file with transformed count data (e.g., normalized, log-transformed) for correlation analysis | transformed_file_path = "qnorm_counts.feather" |
informative_path |
character | Yes | Path to feather file containing informative/significant regions used for original clustering | informative_path = "informative_regions.feather" |
cluster_path |
character | Yes | Path to TSV/TXT file with ‘feature’ and ‘label’ columns defining cluster assignments for informative regions | cluster_path = "row_clusters.tsv" |
out_path |
character | Yes | File path to save final feature-label table (supports .feather, .tsv, .csv formats) | out_path = "./results/all_regions_labeled.feather" |
save_plot |
logical | No (default: FALSE) | Whether to generate and save correlation distribution histogram | save_plot = TRUE |
plot_path |
character | No (default: NULL) | File path for saving histogram plot (PNG format). Required if save_plot = TRUE | plot_path = "./plots/correlation_dist.png" |
cutoff_non_zero |
integer | No (default: 10) | Minimum number of non-zero samples required per region. Regions with MORE than this many non-zero values are kept | cutoff_non_zero = 15 |
quantile_threshold |
numeric | No (default: 0.75) | Quantile threshold (0-1) for filtering high-correlation regions. Only regions above this quantile are assigned cluster labels | quantile_threshold = 0.80 |
The function generates the following output files:
<out_path>
feature: Genomic region identifier (chromosome
coordinates)label: Assigned cluster label or categorycluster_path
(original informative region clusters)| feature | label |
|---|---|
| chr1_9601_10400 | J |
| chr1_10401_11200 | CRF_specific |
| chr1_12801_13600 | Background |
| chr1_14401_15200 | Background |
| chr1_15201_16000 | Background |
| chr1_16001_16800 | Background |
<plot_path> (if save_plot = TRUE)

library(epigenomeR)
nozero_file_path <- "./V1V2_mixed_800_colQC-all-qc_libnorm_noAllZero.feather"
transformed_file_path <- "./V1V2_mixed_800_colQC-all-qc_libnorm_noAllZero_log2_qnorm.feather"
informative_path <- "./V1V2_mixed_800_colQC-all-qc_libnorm_noAllZero_log2_qnorm_gam-40_per-0.01_mean-0.99.feather"
plot_path <- "output/AAA_hist_for_add_regions.png"
cluster_path <- "./bicluster_V_mixed_all-qc_kmeans_euclidean_row_num-15_column_num-16_heatmap_row_clusters_manually_reorganized.tsv"
out_path <- "output/AAA_add_regions_to_cluster_result.tsv"
result <- add_regions_back_to_cluster(nozero_file_path, transformed_file_path, informative_path, cluster_path, out_path, save_plot = TRUE, plot_path = plot_path, cutoff_non_zero = 10, quantile_threshold = 0.75)
head(result)
dim(result)
nrow(result)
This function must be run after clustering analysis of informative regions is complete.
Input file requirements:
cluster_path TSV must have ‘feature’ and ‘label’
columnsCorrelation calculation: Uses Pearson correlation with complete observations (pairwise deletion of NA values).
Non-zero filtering: The
cutoff_non_zero = 10 means regions must have MORE than 10
non-zero samples (not equal to 10).
Quantile threshold interpretation:
quantile_threshold = 0.75 means only regions with
correlation in the top 25% receive cluster assignments.
Output format flexibility: Supports .feather (fast, compact), .tsv (human-readable), or .csv (universal compatibility).
After add non-informative regions back to cluster, you can use
apply_cluster_heatmap() function to redraw the heatmap. The
apply_cluster_heatmap() function is a wrapper that applies
pre-computed cluster assignments to one or multiple count matrices to
generate publication-ready heatmaps. It handles data loading,
validation, and optional comparative analysis across matrices.
What this function does:
Applies existing cluster assignments to count matrices without re-clustering
Validates input files and data integrity
Handles single or multiple count matrices with flexible processing modes
Supports comparative analysis by identifying shared features across matrices
Orders features and samples according to cluster assignments
Generates publication-ready heatmaps with customizable aesthetics
Provides consistent color scaling across multiple matrices
Creates organized output directory structure
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
count_matrix_file_path |
character vector | Yes | Vector of file paths to count matrix .feather files. Each file must have ‘pos’ column (rows=genomic positions, cols=samples) | count_matrix_file_path = c("matrix1.feather", "matrix2.feather") |
row_cluster_file_path |
character | Yes | Path to row cluster assignment .tsv file with ‘feature’ and ‘label’ columns containing position-to-cluster mappings | row_cluster_file_path = "row_clusters.tsv" |
col_cluster_file_path |
character | Yes | Path to column cluster assignment .tsv file with ‘feature’ and ‘label’ columns containing sample-to-cluster mappings | col_cluster_file_path = "col_clusters.tsv" |
out_dir |
character | No (default: NULL) | Directory to save heatmap output. If NULL, uses input file’s directory | out_dir = "./heatmaps" |
seed |
integer | No (default: 123) | Random seed for reproducible results (affects any stochastic elements) | seed = 42 |
show_dend_boolean |
logical | No (default: FALSE) | Whether to show column names in heatmap | show_dend_boolean = TRUE |
count_matrix_overlap |
logical | No (default: FALSE) | Whether to analyze only shared positions and samples across matrices for comparative analysis | count_matrix_overlap = TRUE |
lower_range |
numeric | No (default: NULL) | Lower bound for heatmap color scale. If NULL, auto-determined from data | lower_range = 0 |
upper_range |
numeric | No (default: NULL) | Upper bound for heatmap color scale. If NULL, auto-determined from data | upper_range = 10 |
row_title_fontsize |
numeric | No (default: NULL) | Font size for row cluster titles (A, B, C, etc.). If NULL, uses default | row_title_fontsize = 40 |
col_title_fontsize |
numeric | No (default: NULL) | Font size for column cluster titles (1, 2, 3, etc.). If NULL, uses default | col_title_fontsize = 22 |
legend_title_fontsize |
numeric | No (default: NULL) | Font size for legend title. If NULL, uses default | legend_title_fontsize = 40 |
legend_label_fontsize |
numeric | No (default: NULL) | Font size for legend tick labels. If NULL, uses default | legend_label_fontsize = 30 |
The function generates the following output files in the specified
out_dir:
figures/<basename>_figS3A.pdf
row_cluster_file_pathcol_cluster_file_path<input_matrix_basename>_figS3A.pdflibrary(epigenomeR)
result <- add_regions_back_to_cluster(nozero_file_path, transformed_file_path, informative_path, cluster_path, out_path, save_plot = TRUE, plot_path = plot_path, cutoff_non_zero = 10, quantile_threshold = 0.75)
head(result)
dim(result)
nrow(result)
# save the new row cluster
filtered_result <- result[!result$label %in% c("Background", "CRF_specific"), ]
row_output_path <- "output/AAA_row_cluster_for_heatmap.tsv"
write.table(filtered_result, row_output_path, sep = "\t", row.names = FALSE, quote = FALSE)
# rearrange the order
informative_regions <- read_feather(informative_path)
clusters <- rowcol_km_like_ComplexHeatmap(mat = informative_regions, row_k = 15, col_k = 16)
cluster_group_col <- data.frame(clusters$col_num)
colnames(cluster_group_col) <- "label"
cluster_group_col <- cluster_group_col %>% tibble::rownames_to_column("feature")
col_output_path <- "output/AAA_col_cluster_for_heatmap.tsv"
write.table(cluster_group_col, col_output_path, sep = "\t", row.names = FALSE, quote = FALSE)
apply_cluster_heatmap(count_matrix_file_path = transformed_file_path, row_cluster_file_path = path1, col_cluster_file_path = path2, out_dir = "output/AAA_add_regions_heatmap", lower_range = 0, upper_range = 4.5)
The annotation_ccre_hmm() function analyzes the genomic
annotation composition of clustered regions by overlapping with CCRE
(Candidate Cis-Regulatory Elements) and ChromHMM chromatin state
annotations, then generates comparative bar plots showing regulatory
element distribution across clusters.
What this function does:
Parses genomic coordinates from cluster assignment files
Converts position strings to GenomicRanges objects for overlap analysis
Overlaps clustered regions with CCRE annotations (cell-type agnostic)
Overlaps clustered regions with ChromHMM chromatin state annotations
Calculates percentage composition of regulatory elements per cluster
Generates publication-ready stacked bar plots for visual comparison
Excludes background and epitope-specific regions from analysis
Provides quantitative characterization of cluster regulatory landscapes
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
row_cluster_file_path |
character | Yes | Path to row cluster assignment .tsv file with ‘feature’ column (genomic positions as “chr_start_end”) and ‘label’ column (cluster assignments) | row_cluster_file_path = "row_clusters.tsv" |
out_dir |
character | Yes | Directory to save annotation plot outputs (PDF format) | out_dir = "./annotation_plots" |
The function generates the following output files in the specified
out_dir:
p_ccre_agnostic.pdf

p_chromhmm_short.pdf

library(epigenomeR)
row_cluster_file_path <- "./bicluster_row_cluster.tsv"
out_dir <- "output/figures"
annotation_ccre_hmm(row_cluster_file_path = row_cluster_file_path, out_dir = out_dir)
The biclustering_annotation_TFBS() function provides a
complete, automated workflow for analyzing transcription factor binding
site (TFBS) enrichment across multiple clusters of genomic regions. It
handles the entire pipeline from control region generation to enrichment
testing and visualization.
What this function does:
Reads a file containing clustered genomic regions (e.g., from biclustering analysis)
Generates matched control regions for each cluster using gene-based matching
Performs TFBS enrichment analysis comparing each cluster against controls
Creates heatmap visualizations showing enrichment patterns across clusters
Saves all intermediate and final results to organized output files
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
row_cluster_file_path |
character | Yes | Path to tab-delimited file with clustered regions. Must have
columns: feature (format: chr_start_end) and
label (cluster ID) |
"bicluster_results.tsv" |
out_dir |
character | No (default: “./”) | Output directory where all results will be saved | "./TFBS_results/" |
ref_genome |
character | No (default: “hg38”) | Reference genome version. Options: “hg38”, “hg19”, “mm10”, “mm39” | ref_genome = "mm10" |
control_rep |
integer | No (default: 1) | Multiplier for control region generation. Determines the ratio of
control to target regions. E.g., control_rep = 2 generates
2x control regions; control_rep = 5 generates 5x control
regions |
control_rep = 3 |
regions |
integer | No (default: 800) | Size in base pairs to resize all regions (centered on original midpoint) | regions = 500 |
plot |
logical | No (default: TRUE) | Whether to generate heatmap visualizations. If FALSE, only performs enrichment analysis | plot = FALSE |
Input File Format:
The cluster file must be tab-delimited with at least these columns:
feature label
chr1_1234567_1235567 cluster_1
chr2_9876543_9877543 cluster_1
chr3_5555555_5556555 cluster_2
...
feature: Genomic coordinates in format “chr_start_end”
(underscore-separated)label: Cluster assignment (e.g., “cluster_1”,
“group_A”, “bicluster_2”)All output files are saved to the specified out_dir:
| File | Description |
|---|---|
all_controls.bed |
BED file containing all matched control regions (combined and de-duplicated) |
TFBS_enrichment_cluster_<label>.tsv |
One TSV per cluster with enrichment statistics (odds ratio, p-value, FDR, hit counts) |
TFBS_heatmap_all.pdf |
Heatmap visualization of enriched motifs across all clusters (if
plot = TRUE) |
odds_ratio_log2.csv |
Matrix of log2 odds ratios for all TFs × clusters (if
plot = TRUE) |
FDR.csv |
Matrix of FDR values for all TFs × clusters (if
plot = TRUE) |
library(epigenomeR)
# Basic usage - complete pipeline with visualization
biclustering_annotation_TFBS(
row_cluster_file_path = "NMF_clusters.tsv",
out_dir = "./TFBS_analysis/",
ref_genome = "hg38"
)
# Custom region size for enhancer analysis
biclustering_annotation_TFBS(
row_cluster_file_path = "enhancer_clusters.tsv",
out_dir = "./enhancer_TFBS/",
ref_genome = "hg38",
regions = 1000
)
# Mouse genome analysis
biclustering_annotation_TFBS(
row_cluster_file_path = "mouse_peaks_clustered.tsv",
out_dir = "./mouse_TFBS/",
ref_genome = "mm10",
regions = 500
)
# Generate 3x more control regions for more robust statistics
biclustering_annotation_TFBS(
row_cluster_file_path = "ATAC_peaks_clustered.tsv",
out_dir = "./ATAC_TFBS/",
ref_genome = "hg38",
control_rep = 3,
regions = 800
)
# Enrichment analysis only (no heatmap)
biclustering_annotation_TFBS(
row_cluster_file_path = "clusters.tsv",
out_dir = "./TFBS_tables/",
ref_genome = "hg38",
plot = FALSE
)