1. Fragment Extraction (.bam to .bed)

The frag command converts paired-end BAM files into fragment-level BED files, providing a standardized interval representation of sequenced DNA fragments for downstream coverage construction and signal-based analyses.

This step bridges alignment-level data (BAM) and interval-level representations (BED). It is implemented entirely in bash and focuses on correctness, determinism, and compatibility with downstream tools such as coverage generation and peak calling.

What this script does:

  • Iterates over all BAM files in a user-specified directory.
  • Verifies that each BAM file has an accompanying index to prevent incomplete or corrupted processing.
  • Performs name-sorting on each BAM file to ensure proper pairing of reads.
  • Converts name-sorted BAM files into BEDPE format using bedtools bamtobed -bedpe.
  • Extracts fragment-level genomic coordinates from BEDPE records.
  • Sorts resulting BED files by genomic position to ensure downstream compatibility.
  • Writes one sorted BED file per input BAM file.

The resulting BED files represent inferred DNA fragments derived from paired-end alignments and serve as the canonical input for subsequent coverage and signal track generation steps.

Parameters

Argument Type Default Description Example
-i, –input character Directory containing input BAM files. Each BAM file is processed independently and is assumed to contain paired-end alignments. -i ./bam
-o, –output character Output directory used to store fragment-level BED files. The directory is recreated at runtime to ensure a clean and deterministic output state. -o ./bed

Output Files

The command generates the following output files in the specified out_dir:

  1. Fragment BED files<pair>.bed
    • One file is produced for each input BAM file.
    • <pair> corresponds to the BAM filename without extension.
    • Each BED file contains three columns: chrom, chromStart, and chromEnd, representing the inferred genomic span of each paired-end fragment.
    • Records are sorted by chromosome and start coordinate to ensure compatibility with downstream tools such as bedtools genomecov.

Example Usage

# General Usage (Extract result from qc)

# Test Data
for d in ./bam/*/; do
  sample=$(basename "$d")
  multiEpiPrep frag \
    -i "$d" \
    -o "./bed/${sample}" \
done

2. Coverage Track Generation (.bed to .bedgraph)

The cov command converts BED interval files into genome-wide coverage tracks in bedGraph format, providing a lightweight and reproducible representation of read or fragment density for downstream peak calling and signal-based analyses.

This step is implemented entirely in bash and serves as a standardized preprocessing layer between interval-level BED data and signal-level analyses. Each input BED file is processed independently, producing one bedGraph file with continuous coverage blocks.

What this script does:

  • Accepts a directory of BED files, each representing genomic intervals for one CRF–CRF pair.
  • Resolves the appropriate chromosome size file based on the selected reference genome (hg38 or mm10).
  • Uses bedtools genomecov -bg to compute genome-wide coverage tracks from BED intervals.
  • Writes one .bedGraph file per input BED file into the specified output directory.
  • Performs strict input validation and dependency checks to prevent silent failures.

The generated bedGraph files are directly compatible with downstream signal-based steps, including peak calling, background modeling, and visualization.

Parameters

Argument Type Default Description Example
-i, –input character Directory containing input BED files. Each BED file is treated as an independent sample (typically one CRF–CRF pair). -i ./bed
-o, –output character Output directory used to store generated bedGraph files. The directory is created recursively if it does not already exist. -o ./bedgraph
-g, –genome character Reference genome used to determine chromosome sizes. Supported values are hg38 and mm10. -g hg38

Output Files

The command generates the following output files in the specified out_dir:

  1. Coverage bedGraph files<pair>.bedGraph
    • One file is produced for each input BED file.
    • <pair> corresponds to the BED filename without extension.
    • The output follows standard bedGraph conventions with four columns: chrom, chromStart, chromEnd, and value.
    • Only non-zero coverage intervals are reported, ensuring compact file size and efficient downstream processing.

Example Usage

# Test Data
for d in ./bed/*/; do
  sample=$(basename "$d")
  multiEpiPrep cov \
    -i "$d" \
    -o "./bedgraph/${sample}" \
    -g hg38
done

3. Signal Track Generation (.bed to .bigwig)

The track command converts BED interval files into bigWig signal tracks, producing compact, indexed coverage representations suitable for genome browser visualization and downstream signal-based analyses.

This step is implemented entirely in bash and standardizes the transition from fragment-level intervals (BED) to continuous signal tracks (bigWig). Each input BED file is processed independently, and the command supports optional RPM normalization to make tracks comparable across CRF–CRF pairs with different sequencing depths.

What this script does:

  • Iterates over all BED files in a user-specified directory (one BED per CRF–CRF pair).
  • Resolves the genome-specific chromosome size file based on the selected reference genome (hg38 or mm10).
  • Computes total fragments/reads per BED file by counting non-empty lines (used as the library size estimate).
  • Optionally applies RPM normalization by scaling coverage with 1e6 / total_reads.
  • Runs bedtools genomecov -bg to generate a bedGraph coverage track.
  • Sorts bedGraph by genomic coordinate (chrom, start) using a stable locale setting (LC_ALL=C).
  • Converts sorted bedGraph to bigWig via bedGraphToBigWig.
  • Cleans up intermediate bedGraph files to keep outputs minimal and deterministic.

The resulting bigWig tracks can be loaded directly into genome browsers (e.g., IGV/UCSC) and serve as the canonical signal representation for visualization and interpretation.

Parameters

Argument Type Default Description Example
-i, –input character Directory containing input BED files. Each BED file is treated as an independent sample (typically one CRF–CRF pair). -i ./bed
-o, –output character Output directory used to store generated bigWig tracks (.bw). Created recursively if it does not exist. -o ./tracks
-g, –genome character Reference genome used to determine chromosome sizes for coverage calculation and bigWig indexing. Supported values are hg38 and mm10. -g hg38
-n, –normalized
–no-normalized
logical true Whether to apply RPM normalization. If enabled, coverage is scaled by 1e6 / total_reads prior to exporting. Use –no-normalized to export raw (unscaled) coverage tracks. -n
–no-normalized

Output Files

The command generates the following output files in the specified out_dir:

  1. bigWig tracks<pair>.bw
    • One file is produced for each input BED file.
    • <pair> corresponds to the BED filename without extension.
    • Tracks are indexed using the selected genome’s chrom.sizes, enabling fast random access for genome browsers.
    • If normalization is enabled, signal values represent RPM-scaled coverage and are comparable across pairs with different depths.

Example Usage

# Test Data
for d in ./bed/*/; do
  sample=$(basename "$d")
  multiEpiPrep cov \
    -i "$d" \
    -o "./bigwig/${sample}" \
    -g hg38
done