1. Alignment

The align command aligns preprocessed paired-end FASTQ files to a reference genome and produces cleaned, deduplicated, indexed BAM files for downstream QC and peak-level analysis.

The command automatically processes all paired FASTQ files detected in the input directory. Each FASTQ pair is aligned independently and produces one final BAM file.

For each detected paired-end FASTQ pair, the pipeline performs the following steps:

Reads are aligned to the reference genome using bowtie2 in local, very-sensitive-local mode. Fragment size constraints are applied using the Bowtie2 paired-end insert size filters.
The resulting alignments are converted to BAM format, filtered by mapping quality, and cleaned by removing mitochondrial and non-standard contigs. The filtered alignments are then sorted by genomic coordinates.
Duplicate removal is performed after alignment. All deduplication strategies always consider genomic alignment coordinates as the primary criterion. Depending on the adapter structure used upstream, additional tag-aware grouping may be applied.
- If reads contain cell barcode (CB) information encoded in the read name, duplicate identification becomes CB-aware, meaning that reads with identical coordinates but originating from different cell barcodes are not treated as duplicates.
- If reads contain UMI information encoded in the read name, duplicate identification becomes UMI-aware, meaning that reads with identical coordinates but different UMI values are preserved as independent molecules.
- When both CB and UMI information are present, duplicates are identified using the combined key of genomic coordinates, CB, and UMI, ensuring that deduplication respects both cellular origin and molecular identity.
After duplicate removal, the coordinate-sorted BAM file is written to disk and indexed using samtools index. If indexing fails due to sorting inconsistencies, the file is resorted and indexed again to guarantee a valid output.

Parameters

Argument	Type	Default	Description	Example
`-i, –input`	character	—	Directory containing paired-end FASTQ files (`.fastq.gz`) generated from the adapter identification step.	`-i ./adapter`
`-o, –output`	character	—	Output directory where final BAM files will be written (created if missing).	`-o ./bam`
`-g, –genome`	character	—	Reference genome name used to resolve the Bowtie2 index (`hg38` or `mm10`).	`-g hg38`
`–cb`	flag	off	Indicates that reads contain cell barcode (CB) information encoded in the read name. If enabled, deduplication becomes cell-barcode-aware, meaning reads with identical genomic coordinates but different CB values are preserved.	`–cb`
`–umi`	flag	off	Indicates that reads contain UMI information encoded in the read name. If enabled, deduplication becomes UMI-aware, meaning reads with identical genomic coordinates but different UMIs are preserved as independent molecules.	`–umi`
`–min-len`	integer	`10`	Minimum fragment length accepted by Bowtie2 paired-end alignment. Pairs with inferred insert size below this threshold are discarded.	`–min-len 10`
`–max-len`	integer	`800`	Maximum fragment length accepted by Bowtie2 paired-end alignment (`-X`). Pairs with inferred insert size above this threshold are discarded.	`–max-len 1200`
`-j, –threads`	integer	auto-detect	Number of CPU threads used by Bowtie2 and samtools operations.	`-j 16`
`–java-mem`	character	auto-detect	Java heap size passed to Picard (e.g., `24g`, `8000m`). Automatically detected if not provided.	`–java-mem 24g`
`–picard-jar`	character	auto-detect	Path to `picard.jar`. If missing, the script attempts to locate it automatically in the conda environment.	`–picard-jar /path/to/picard.jar`

Output Files

The command generates the following output files in the specified OUT_DIR (one set per FASTQ prefix <prefix>):

Final BAM - OUT_DIR/<prefix>.bam
- Coordinate-sorted, filtered, and deduplicated BAM file.
BAM index - OUT_DIR/<prefix>.bam.bai
- BAM index created by samtools index.

Intermediate files generated during alignment and deduplication are automatically removed to minimize disk usage.

Example Usage

# Regular Hiplex CUT&Tag
multiEpiPrep align \
  -i ./adapter \
  -o ./bam \
  -g hg38

# UMI-containing Hiplex CUT&Tag
multiEpiPrep align \
  -i ./adapter \
  -o ./bam \
  -g hg38 \
  --umi

# CB + UMI aware deduplication
multiEpiPrep align \
  -i ./adapter \
  -o ./bam \
  -g hg38 \
  --cb \
  --umi