1. Identify Adapter

The adapt command scans merged, demultiplexed paired-end FASTQ files for a user-defined adapter structure and appends parsed tag information to the read name for downstream processing.

The adapter structure is explicitly defined as:

[cell barcode] - [spacer] - [umi] - [linker]

All four components are interpreted in this fixed order during matching. Depending on the experimental design, each component can be defined either by fixed length or by one or more candidate sequences.

For each detected paired-end FASTQ pair, the pipeline performs the following steps:

  • Reads are processed in paired-end mode from merged demultiplexed FASTQ files, with one FASTQ pair per prefix. Files whose prefixes are listed in --exclude are skipped before adapter identification begins.

  • The command scans the 5’ region of each read according to the predefined adapter structure, where the expected tag order is always [cell barcode] - [spacer] - [umi] - [linker]. Each component is matched sequentially, and the parser allows user-controlled flexibility through mismatch tolerance and positional laxity.

  • Each adapter component can be specified in one of four supported formats: fixed length, fixed sequence, comma-separated candidate sequence list, or @file containing one candidate sequence per line. This allows the same command to support simple fixed-layout designs as well as more complex candidate-based barcode structures.

  • For sequence-based components, mismatches are allowed according to --error-rate, where the maximum mismatch count is computed as floor(length(tag) * error_rate). In addition, --laxity controls how many bases may be skipped when searching for the next expected tag in the structure.

  • Once adapter components are identified in both R1 and R2, the parsed tag values are appended to the read name. CB and UMI information are additionally written as explicit key-value fields in the read header, making them easy to retrieve in downstream steps such as grouping, tracking, or UMI-aware processing.

  • The output consists of trimmed FASTQ files with updated read names, written as paired gzipped FASTQ files to the specified output directory.

Parameters

Argument Type Default Description Example
-i, –input character Directory containing merged demultiplexed paired-end FASTQ files in .fastq.gz format. -i ./demux
-o, –output character Output directory for trimmed FASTQ files with annotated read names. -o ./adapter
–cb integer / character Definition of the cell barcode component in the adapter structure. Supported formats:
  • Integer: fixed length
  • String: fixed sequence
  • Comma-separated sequences: candidate sequence list
  • @file: one candidate sequence per line
–cb AAAA,CCCC,GGGG,TTTT
–sp integer / character Definition of the spacer component in the adapter structure. Supported formats are the same as for –cb. –sp 8
–umi integer / character Definition of the UMI component in the adapter structure. Supported formats are the same as for –cb. –umi 8
–linker integer / character Definition of the linker component in the adapter structure. Supported formats are the same as for –cb. –linker GCGATCGAGGACGGCAGATGTGTATAAGAGACAG
-r, –error-rate numeric 0.1 Mismatch rate allowed for sequence-based tag matching. The maximum mismatch count is calculated as floor(length(tag) * error_rate). -r 0.1
-l, –laxity integer 0 Maximum number of bases allowed to skip when searching for the next tag in the adapter structure. -l 2
-e, –exclude character vector c(“unknown”, “IgG_control”) FASTQ prefixes to skip before adapter identification. -e unknown IgG_control
-j, –threads integer auto-detect Number of FASTQ prefixes to process in parallel. Falls back to all available CPU cores when not provided. -j 8

Read Name Annotation

When adapter components are successfully identified, the parsed values from both R1 and R2 are appended to the read name. CB and UMI values are additionally recorded as explicit key-value fields.

Input Example: @A00123:45:H3F7MDSX2:1:1101:10000:1000 1:N:0:ATCGTAGC Output Example: @A00123:45:H3F7MDSX2:1:1101:10000:1000::[GGGG]-[CCCCCC]::[AAAA]-[TTTTTT]::CB=GGGG-AAAA::UMI=CCCCCC-TTTTTT

Output Files

The command generates the following output files in the specified OUT_DIR:

  • OUT_DIR/{name1}-{name2}_R1.trimmed.fastq.gz
  • OUT_DIR/{name1}-{name2}_R2.trimmed.fastq.gz

Example Usage

# 1) Regular Hiplex CUT&Tag
multiEpiPrep adapt \
  -i ./demux \
  -o ./adapter \
  --linker GCGATCGAGGACGGCAGATGTGTATAAGAGACAG,CACCGTCTCCGCCTCAGATGTGTATAAGAGACAG

# 2) UMI-containing Hiplex CUT&Tag
multiEpiPrep adapt \
  -i ./demux \
  -o ./adapter \
  --sp 8 \
  --umi 8 \
  --linker GCGATCGAGGACGGCAGATGTGTATAAGAGACAG,CACCGTCTCCGCCTCAGATGTGTATAAGAGACAG

# 3) Candidate CB sequence list
multiEpiPrep adapt \
  -i ./demux \
  -o ./adapter \
  --cb AAAA,CCCC,GGGG,TTTT \
  --umi 8 \
  --linker GCGATCGAGGACGGCAGATGTGTATAAGAGACAG,CACCGTCTCCGCCTCAGATGTGTATAAGAGACAG

# 4) Strict mode
multiEpiPrep adapt \
  -i ./demux \
  -o ./adapt \
  -r 0 \
  -l 0

# Test data
DEMUX_DIR="./demux"
TRIM_DIR="./trim"

for d in "$DEMUX_DIR"/*/; do
  [ -d "$d" ] || continue
  sample=$(basename "$d")
  echo "======================"
  echo "$sample"
  echo "======================"

  out="${TRIM_DIR}/${sample}"
  multiEpiPrep adapt -i "$d" -o "$out" -g hg38 --linker GCGATCGAGGACGGCAGATGTGTATAAGAGACAG,CACCGTCTCCGCCTCAGATGTGTATAAGAGACAG
done