The align command aligns preprocessed paired-end FASTQ
files to a reference genome and produces cleaned, deduplicated, indexed
BAM files for downstream QC and peak-level analysis.
The command automatically processes all paired FASTQ files detected in the input directory. Each FASTQ pair is aligned independently and produces one final BAM file.
For each detected paired-end FASTQ pair, the pipeline performs the following steps:
Reads are aligned to the reference genome using
bowtie2 in local, very-sensitive-local mode. Fragment size
constraints are applied using the Bowtie2 paired-end insert size
filters.
The resulting alignments are converted to BAM format, filtered by mapping quality, and cleaned by removing mitochondrial and non-standard contigs. The filtered alignments are then sorted by genomic coordinates.
Duplicate removal is performed after alignment. All deduplication strategies always consider genomic alignment coordinates as the primary criterion. Depending on the adapter structure used upstream, additional tag-aware grouping may be applied.
If reads contain cell barcode (CB) information encoded in the read name, duplicate identification becomes CB-aware, meaning that reads with identical coordinates but originating from different cell barcodes are not treated as duplicates.
If reads contain UMI information encoded in the read name, duplicate identification becomes UMI-aware, meaning that reads with identical coordinates but different UMI values are preserved as independent molecules.
When both CB and UMI information are present, duplicates are identified using the combined key of genomic coordinates, CB, and UMI, ensuring that deduplication respects both cellular origin and molecular identity.
After duplicate removal, the coordinate-sorted BAM file is
written to disk and indexed using samtools index. If
indexing fails due to sorting inconsistencies, the file is resorted and
indexed again to guarantee a valid output.
| Argument | Type | Default | Description | Example |
|---|---|---|---|---|
-i, –input
|
character | — |
Directory containing paired-end FASTQ files (.fastq.gz)
generated from the adapter identification step.
|
-i ./adapter
|
-o, –output
|
character | — | Output directory where final BAM files will be written (created if missing). |
-o ./bam
|
-g, –genome
|
character | — |
Reference genome name used to resolve the Bowtie2 index
(hg38 or mm10).
|
-g hg38
|
–cb
|
flag | off |
Indicates that reads contain cell barcode (CB) information encoded in
the read name. If enabled, deduplication becomes cell-barcode-aware, meaning reads with identical genomic coordinates but different CB values are preserved. |
–cb
|
–umi
|
flag | off |
Indicates that reads contain UMI information encoded in the read
name. If enabled, deduplication becomes UMI-aware, meaning reads with identical genomic coordinates but different UMIs are preserved as independent molecules. |
–umi
|
–min-len
|
integer |
10
|
Minimum fragment length accepted by Bowtie2 paired-end alignment. Pairs with inferred insert size below this threshold are discarded. |
–min-len 10
|
–max-len
|
integer |
800
|
Maximum fragment length accepted by Bowtie2 paired-end alignment
(-X). Pairs with inferred insert size above this threshold
are discarded.
|
–max-len 1200
|
-j, –threads
|
integer | auto-detect | Number of CPU threads used by Bowtie2 and samtools operations. |
-j 16
|
–java-mem
|
character | auto-detect |
Java heap size passed to Picard (e.g., 24g,
8000m). Automatically detected if not provided.
|
–java-mem 24g
|
–picard-jar
|
character | auto-detect |
Path to picard.jar. If missing, the script attempts to
locate it automatically in the conda environment.
|
–picard-jar /path/to/picard.jar
|
The command generates the following output files in the specified
OUT_DIR (one set per FASTQ prefix
<prefix>):
OUT_DIR/<prefix>.bam
OUT_DIR/<prefix>.bam.bai
samtools index.Intermediate files generated during alignment and deduplication are automatically removed to minimize disk usage.
# Regular Hiplex CUT&Tag
multiEpiPrep align \
-i ./adapter \
-o ./bam \
-g hg38
# UMI-containing Hiplex CUT&Tag
multiEpiPrep align \
-i ./adapter \
-o ./bam \
-g hg38 \
--umi
# CB + UMI aware deduplication
multiEpiPrep align \
-i ./adapter \
-o ./bam \
-g hg38 \
--cb \
--umi