1. Fragment Length Analysis Pipeline

The frag_decomposition() function performs a complete fragment length analysis pipeline including quality control, valley detection, and visualization for nucleosome positioning analysis from BAM files.

What this function does:

  • Performs quality control filtering on BAM files based on coverage percentile

  • Computes fragment lengths from filtered paired-end BAM files

  • Detects two local minimum valleys for nucleosome fragment classification

  • Generates fragment length histograms with valley cutoff lines

  • Creates summary statistics report (valley positions, min/max fragment lengths)

  • Produces fragment decomposition data with percentages

  • Generates bar plots showing fragment distribution across samples

Parameters

Parameter Type Required Description Example
file_path character vector Yes Paths to BAM files to analyze file_path = c("sample1.bam", "sample2.bam")
save_dir character Yes Directory path to save all output files and plots save_dir = "./fragment_analysis"
frag_decomp_file character No Path to generated fragment decomposition file (returned by function) Auto-generated
filtered_percentile numeric No Percentile threshold for quality control filtering (default: 0.25, removes bottom 25%) filtered_percentile = 0.25
dens_reso numeric No Resolution for kernel density estimation in valley detection (default: 2^15 = 32768) dens_reso = 2^15
target_pair_mapping_df data frame No Optional data frame for mapping sample names to display names (default: NULL) target_pair_mapping_df = df
density_kernel character No Kernel type for density estimation (default: โ€œgaussianโ€) density_kernel = "gaussian"
valley1_range numeric vector No Search range for first valley in bp (default: c(73, 221)) valley1_range = c(73, 221)
valley2_range numeric vector No Search range for second valley in bp (default: c(222, 368)) valley2_range = c(222, 368)

Output Files

The function generates the following output files in the specified save_dir:

  1. Fragment length data - premerge_all-qc_frag_lens.RData
    • Cached fragment length data from all filtered samples
    • Numeric vector of fragment lengths
  2. Initial histogram - premerge_frag_hist.pdf
    • Fragment length distribution histogram without valley cutoffs
    • Shows overall fragment length distribution
  3. Summary report - summary_report.csv
    • Contains key metrics:
      • local_min1: First valley position (bp)
      • local_min2: Second valley position (bp)
      • min_frag_len: Minimum fragment length (bp)
      • max_frag_len: Maximum fragment length (bp)
  4. Histogram with cutoffs - premerge_frag_hist_with_cutoff.pdf
    • Fragment length distribution with vertical lines at valley positions
    • Used to visualize nucleosome fragment classification
  5. Fragment decomposition data - *_frag_decomp_with_perc.tsv
    • Per-sample fragment counts and percentages
    • Classified by valley thresholds: subnucleo, monomer, dimer+
  6. Bar plots - save_dir/barplots/*.pdf
    • Visual representation of fragment distribution across samples
    • Separate plots for different fragment categories

Example Usage

library(epigenomeR)
frag_decomposition(file_path = bam_files_vector, save_dir = "output/frag_decomposition")

2. Split bam files in bash

## Use in bash
## change with your length of scen_list
#SBATCH --array=0-1

module load samtools

# setup (change)
thres_name="valley-V-qc"
scen_list=( "T" "V" ) # input dir name
scen_simple_name_list=( "T" "V" ) # output name
target_list=( "IgG_control" "H3K36me3" "H3K4me1"... )
load_root_dir="./data_align" # directory path for scen_list
save_root_raw_dir="output/frag_decomposition/split_bam"

# main function
run_frag_decomposition() {
    scen=${scen_list[${SLURM_ARRAY_TASK_ID}]}
    scen_simple_name=${scen_simple_name_list[${SLURM_ARRAY_TASK_ID}]}

    save_root_dir=${save_root_raw_dir}/${thres_name}
    mkdir -p "${save_root_dir}"

    load_dir=${load_root_dir}/${scen}
    mixed_dir=${save_root_dir}/${scen_simple_name}_mixed
    subneucleo_dir=${save_root_dir}/${scen_simple_name}_sub
    mononeucleo_dir=${save_root_dir}/${scen_simple_name}_mono
    dineucleo_dir=${save_root_dir}/${scen_simple_name}_di

    mkdir -p "${mixed_dir}" "${subneucleo_dir}" "${mononeucleo_dir}" "${dineucleo_dir}"
    cp -r ${load_dir}/* "${mixed_dir}/"

    for target_1 in "${target_list[@]}"; do
        for target_2 in "${target_list[@]}"; do
            if [ "${target_1}" \< "${target_2}" ] || [ "${target_1}" = "${target_2}" ]; then
                mixed_filename="${target_1}-${target_2}.bam"
                mixed_dir_filename=${load_dir}/${mixed_filename}

                if [[ ! -f "${mixed_dir_filename}" ]]; then
                    echo "Warning: ${mixed_dir_filename} not found, skip."
                    continue
                fi
                samtools view -h ${mixed_dir_filename} | awk 'substr($0,1,1)=="@" || ($9>=24 && $9<=126) || ($9<=-24 && $9>=-126)' | samtools view -b > ${subneucleo_dir}/${mixed_filename}
                samtools view -h ${mixed_dir_filename} | awk 'substr($0,1,1)=="@" || ($9>=127 && $9<=297) || ($9<=-127 && $9>=-297)' | samtools view -b > ${mononeucleo_dir}/${mixed_filename}
                samtools view -h ${mixed_dir_filename} | awk 'substr($0,1,1)=="@" || ($9>=298 && $9<=800) || ($9<=-298 && $9>=-800)' | samtools view -b > ${dineucleo_dir}/${mixed_filename}
            fi
        done
    done
}

run_frag_decomposition