The adapt command scans merged, demultiplexed paired-end
FASTQ files for a user-defined adapter structure and appends parsed tag
information to the read name for downstream processing.
The adapter structure is explicitly defined as:
[cell barcode] - [spacer] - [umi] - [linker]
All four components are interpreted in this fixed order during matching. Depending on the experimental design, each component can be defined either by fixed length or by one or more candidate sequences.
For each detected paired-end FASTQ pair, the pipeline performs the following steps:
Reads are processed in paired-end mode from merged demultiplexed
FASTQ files, with one FASTQ pair per prefix. Files whose prefixes are
listed in --exclude are skipped before adapter
identification begins.
The command scans the 5’ region of each read according to the
predefined adapter structure, where the expected tag order is always
[cell barcode] - [spacer] - [umi] - [linker]. Each
component is matched sequentially, and the parser allows user-controlled
flexibility through mismatch tolerance and positional laxity.
Each adapter component can be specified in one of four supported
formats: fixed length, fixed sequence, comma-separated candidate
sequence list, or @file containing one candidate sequence
per line. This allows the same command to support simple fixed-layout
designs as well as more complex candidate-based barcode
structures.
For sequence-based components, mismatches are allowed according
to --error-rate, where the maximum mismatch count is
computed as floor(length(tag) * error_rate). In addition,
--laxity controls how many bases may be skipped when
searching for the next expected tag in the structure.
Once adapter components are identified in both R1 and R2, the parsed tag values are appended to the read name. CB and UMI information are additionally written as explicit key-value fields in the read header, making them easy to retrieve in downstream steps such as grouping, tracking, or UMI-aware processing.
The output consists of trimmed FASTQ files with updated read names, written as paired gzipped FASTQ files to the specified output directory.
| Argument | Type | Default | Description | Example |
|---|---|---|---|---|
-i, –input
|
character | — |
Directory containing merged demultiplexed paired-end FASTQ files in
.fastq.gz format.
|
-i ./demux
|
-o, –output
|
character | — | Output directory for trimmed FASTQ files with annotated read names. |
-o ./adapter
|
–cb
|
integer / character | — |
Definition of the cell barcode component in the adapter
structure. Supported formats:
|
–cb AAAA,CCCC,GGGG,TTTT
|
–sp
|
integer / character | — |
Definition of the spacer component in the adapter
structure. Supported formats are the same as for –cb.
|
–sp 8
|
–umi
|
integer / character | — |
Definition of the UMI component in the adapter
structure. Supported formats are the same as for –cb.
|
–umi 8
|
–linker
|
integer / character | — |
Definition of the linker component in the adapter
structure. Supported formats are the same as for –cb.
|
–linker GCGATCGAGGACGGCAGATGTGTATAAGAGACAG
|
-r, –error-rate
|
numeric |
0.1
|
Mismatch rate allowed for sequence-based tag matching. The maximum
mismatch count is calculated as floor(length(tag) *
error_rate).
|
-r 0.1
|
-l, –laxity
|
integer |
0
|
Maximum number of bases allowed to skip when searching for the next tag in the adapter structure. |
-l 2
|
-e, –exclude
|
character vector |
c(“unknown”, “IgG_control”)
|
FASTQ prefixes to skip before adapter identification. |
-e unknown IgG_control
|
-j, –threads
|
integer | auto-detect | Number of FASTQ prefixes to process in parallel. Falls back to all available CPU cores when not provided. |
-j 8
|
When adapter components are successfully identified, the parsed values from both R1 and R2 are appended to the read name. CB and UMI values are additionally recorded as explicit key-value fields.
Input Example: @A00123:45:H3F7MDSX2:1:1101:10000:1000 1:N:0:ATCGTAGC Output Example: @A00123:45:H3F7MDSX2:1:1101:10000:1000::[GGGG]-[CCCCCC]::[AAAA]-[TTTTTT]::CB=GGGG-AAAA::UMI=CCCCCC-TTTTTT
The command generates the following output files in the specified
OUT_DIR:
OUT_DIR/{name1}-{name2}_R1.trimmed.fastq.gzOUT_DIR/{name1}-{name2}_R2.trimmed.fastq.gz# 1) Regular Hiplex CUT&Tag
multiEpiPrep adapt \
-i ./demux \
-o ./adapter \
--linker GCGATCGAGGACGGCAGATGTGTATAAGAGACAG,CACCGTCTCCGCCTCAGATGTGTATAAGAGACAG
# 2) UMI-containing Hiplex CUT&Tag
multiEpiPrep adapt \
-i ./demux \
-o ./adapter \
--sp 8 \
--umi 8 \
--linker GCGATCGAGGACGGCAGATGTGTATAAGAGACAG,CACCGTCTCCGCCTCAGATGTGTATAAGAGACAG
# 3) Candidate CB sequence list
multiEpiPrep adapt \
-i ./demux \
-o ./adapter \
--cb AAAA,CCCC,GGGG,TTTT \
--umi 8 \
--linker GCGATCGAGGACGGCAGATGTGTATAAGAGACAG,CACCGTCTCCGCCTCAGATGTGTATAAGAGACAG
# 4) Strict mode
multiEpiPrep adapt \
-i ./demux \
-o ./adapt \
-r 0 \
-l 0
# Test data
DEMUX_DIR="./demux"
TRIM_DIR="./trim"
for d in "$DEMUX_DIR"/*/; do
[ -d "$d" ] || continue
sample=$(basename "$d")
echo "======================"
echo "$sample"
echo "======================"
out="${TRIM_DIR}/${sample}"
multiEpiPrep adapt -i "$d" -o "$out" -g hg38 --linker GCGATCGAGGACGGCAGATGTGTATAAGAGACAG,CACCGTCTCCGCCTCAGATGTGTATAAGAGACAG
done