1. Build Count Matrix

The build_count_matrix() function builds a fragment-overlap count matrix from paired-end BAM files over user-specified genomic regions.

Parameters

Required Parameters

  • bam_path: Vector of BAM file paths
  bam_path <- c("../sample1.bam", "../sample2.bam", "../sample3.bam")
  • regions: Genomic regions to evaluate, either:
    • Integer (bin size): The genome will be segmented into non-overlapping bins of this size, and CUT&Tag counts will be summarized per bin.
    • File path: Genomic intervals will be extracted directly from the file, and counts will be summarized exactly on these regions. When a file is provided, the file must contain valid genomic coordinates. Only one region file is accepted in the current implementation.
    regions <- 800
    regions <- "peaks.bed"
  • save_dir: Output directory
  save_dir <- "output/count_matrices"

Optional Parameters

Parameter Default Description
ref_genome “hg38” Reference genome
sample_name NULL Optional prefix added to the output filename. If provided, the final Feather file will include this value at the beginning of the filename.
do_qc FALSE Perform quality control
qc_percent 0.25 QC filtering threshold (filter bottom 25%)
force_chr_coord FALSE By default, if the provided regions file contains a gene identifier column (for example, gene_id in gff/tsv files or the 4th “name” column in BED files), this column will be used as the row name instead of the “CHR_start_end” coordinate string.
Setting force_chr_coord = TRUE overrides this behavior and forces the use of genomic coordinates as row names.

Output Files

The generated count matrix is saved as a Feather file following these conventions:

  • Fixed-size bins (numeric regions):
    Count_Matrix_<BINSIZE>.feather (e.g., Count_Matrix_5000.feather)

  • Custom regions file (character regions):
    Count_Matrix_<prefix>.feather, where <prefix> is the basename of the region file (without extension). (e.g., "peaks.bed"Count_Matrix_peaks.feather)

  • When sample_name is provided:
    <sample_name>_<Count_Matrix_xxx>.feather (e.g., C1_Count_Matrix_peaks.feather)

Read Output Files

library(arrow)
count_matrix <- read_feather("output/Count_Matrix_800.feather")
head(count_matrix)
A tibble: 6 × 5
pos H3K27ac-H3K4me3 H3K4me3-H3K4me3 H3K4me1-H3K4me3 H3K4me3-H3K9me3
<chr> <dbl> <dbl> <dbl> <dbl>
1 chr1_1_800 0 0 0 0
2 chr1_801_1600 0 0 0 0
3 chr1_1601_2400 0 0 0 0
4 chr1_2401_3200 0 0 0 0
5 chr1_3201_4000 0 0 0 0
6 chr1_4001_4800 0 0 0 0

Usage Examples

Example 1: Basic Usage with Fixed Bins

# Generate count matrix with 800b bins
build_count_matrix(
  bam_path = bam_files,
  regions = 800,
  save_dir = "output/5kb_bins"
)

Example 2: Using Custom Peak Regions

# Use peak regions from BED file
build_count_matrix(
  bam_path = bam_files,
  regions = "peaks/merged_peaks.bed",
  save_dir = "output/peak_counts"
)

Example 3: With Quality Control

# Enable QC filtering
build_count_matrix(
  bam_path = bam_files,
  regions = 800,
  save_dir = "output/5kb_qc",
  do_qc = TRUE,
  qc_filtered_percentile = 0.25 
)

2. Apply transformation

The apply_transformations() function performs a sequential series of normalization and transformation steps on a count matrix and outputs the processed result.

Parameters

Required Parameters

  • cm_path: Path to the input count matrix (.feather) The input file must be a valid .feather file containing a positional column (pos) or any first column that uniquely identifies genomic intervals. All remaining columns must be numeric.
cm_path <- "input/sample_counts.feather"
  • save_dir: Output directory
  save_dir <- "output/count_matrices"

Optional Parameters

Parameter Default Description
transformations c(“remove0”, “libnorm”, “log2p1”, “qnorm”) Vector specifying the transformation pipeline. Steps are executed in order, and the behavior depends on the sequence you provide (see the table below for details).
save_each_step FALSE If TRUE, writes a .feather file after each step, allowing inspection of intermediate matrices.

The order of operations in transformations directly determines the output. Below is the full list of supported steps:

Transformation Description
remove0 Remove rows where all samples have zero counts
libnorm Library size normalization (CPM, counts per million)
log2p1 Log2 transformation: log2(x + 1)
qnorm Quantile normalization across samples
minmaxnorm Scale values to [0, 1] range
sqrt Square root transformation

Output Files

After processing all selected transformations, a final Feather file named:

<original_name>_transformed.feather

If set save_each_step = TRUE, the output path will be:

save_dir/
├── [count_matrix_prefix]_transformed.feather
├── [count_matrix_prefix]_remove0.feather
├── [count_matrix_prefix]_remove0_libnorm.feather
├── [count_matrix_prefix]_remove0_libnorm_log2p1.feather
└── [count_matrix_prefix]_remove0_libnorm_log2p1_qnorm.feather

Example Usage

apply_transformations(
  cm_path       = "Count_Matrix_800.feather",
  transformations = c("remove0", "libnorm", "log2p1", "qnorm"),
  save_dir        = "./",
  save_each_step  = TRUE
)