This README is the developer-facing reference for the full PepSeqPred training and evaluation pipeline.
For lightweight inference usage and API quickstart, use README.pypi.md.
PepSeqPred supports two usage profiles:
- PyPI quickstart profile (
pip install pepseqpred): user-facing inference API with bundled pretrained artifacts and artifact-path inference helpers. - Repository developer profile (
pip install -e .[dev]): full source tree for preprocessing, embeddings, label generation, model-head training, Optuna tuning, prediction, evaluation, and HPC orchestration.
The repository profile is the source of truth for reproducing experiments end-to-end.
| Path | Purpose |
|---|---|
src/pepseqpred/apps/ |
CLI entrypoints for each pipeline stage |
src/pepseqpred/core/preprocess/ |
Metadata and z-score preprocessing |
src/pepseqpred/core/embeddings/ |
ESM-2 sequence embedding generation |
src/pepseqpred/core/labels/ |
Residue-level label construction |
src/pepseqpred/core/data/ |
Iterable dataset and windowing/padding logic |
src/pepseqpred/core/models/ |
FFNN model definitions |
src/pepseqpred/core/train/ |
DDP, splitting, metrics, thresholds, trainer, seeds, weights |
src/pepseqpred/core/predict/ |
Checkpoint/manifest resolution and inference logic |
src/pepseqpred/core/io/ |
FASTA/TSV readers, key parsing, logging, CSV appends |
src/pepseqpred/api/ |
Stable Python inference API and pretrained registry |
scripts/hpc/ |
SLURM wrappers for each production stage |
scripts/tools/ |
Zipapp build tools and Cocci eval prep/compare tooling |
tests/ |
Unit, integration, and e2e coverage |
envs/ |
Conda environment specs for local and HPC |
Stage 1 normalize dataset inputs (PV1/CWP/BKP) to a shared training contract
Stage 2 generate ESM-2 per-residue embeddings
Stage 3 build residue-level label shards
Stage 4 train model head (unified n-fold interface, DDP-aware)
Stage 5 optional Optuna tuning (DDP-aware)
Stage 6 predict residue masks from checkpoint/manifest
Stage 7 evaluate residue metrics (+ optional Cocci peptide compare)
*Note: data/sample/ contains the expected dataset sample formats.
CLI: pepseqpred-prepare-dataset (src/pepseqpred/apps/prepare_dataset_cli.py)
This stage is the recommended entrypoint when training on one or more of:
- PV1 (human virome)
- CWP/Cocci (fungal)
- BKP (bacterial)
It normalizes source-specific metadata and FASTA headers into a shared PV1-compatible contract (i.e., ID= AC= OXX=) so downstream embedding, label generation, and training CLIs can be reused unchanged.
Core module
src/pepseqpred/core/preprocess/preparedataset.py
Required output contract per dataset
prepared_targets.fastaprepared_labels_metadata.tsvprepared_embedding_metadata.tsvprepare_summary.json
PV1 inputs and command
- metadata TSV
- z-score TSV
- protein FASTA
pepseqpred-prepare-dataset \
data/PV1/PV1_meta_2020-11-23_cleaned.tsv \
data/PV1/prepared \
--dataset-kind pv1 \
--protein-fasta data/PV1/PV1_targets.fasta \
--z-file data/PV1/PV1_zscores.tsvCWP/Cocci inputs and command
- metadata TSV
- protein FASTA
- reactive code list TSV
- non-reactive code list TSV
pepseqpred-prepare-dataset \
data/Cocci/CWP_metadata.tsv \
data/Cocci/prepared \
--dataset-kind cwp \
--protein-fasta data/Cocci/CWP_targets.faa \
--reactive-codes data/Cocci/CWP_reactive_Z20N4.tsv \
--nonreactive-codes data/Cocci/CWP_nonreactive_Z20N4.tsvBKP inputs and command
- metadata TSV
- protein FASTA
- reactive code list TSV
- non-reactive code list TSV
pepseqpred-prepare-dataset \
data/BKP/BKP_metadata.tsv \
data/BKP/prepared \
--dataset-kind bkp \
--protein-fasta data/BKP/BKP.faa \
--reactive-codes data/BKP/BKP_reactive_Z20N4.tsv \
--nonreactive-codes data/BKP/BKP_nonreactive_Z20N4.tsvDataset-specific grouping used for leakage-aware splitting (--split-type id-family)
- PV1: family from PV1
OXX - CWP/Cocci:
Cluster50IDmapped to deterministic numeric IDs - BKP:
reClusterID_70mapped to deterministic numeric IDs
Next stages after prepare
- run
pepseqpred-esmwith--embedding-key-mode id-familyand each dataset'sprepared_embedding_metadata.tsv - run
pepseqpred-labelswith--embedding-key-delim - - train with
--split-type id-family
CLI: pepseqpred-preprocess (src/pepseqpred/apps/preprocess_cli.py)
Inputs
- metadata TSV
- z-score TSV
Core modules
core/preprocess/pv1.pycore/preprocess/zscores.pycore/io/read.py
Command
pepseqpred-preprocess data/meta.tsv data/zscores.tsv --saveOutput
- training-ready metadata TSV with
Def epitope,Uncertain,Not epitope - default filename pattern:
input_data_<is_epi_z>_<is_epi_min_subs>_<not_epi_z>_<not_epi_max_subs|all>.tsv
CLI: pepseqpred-esm (src/pepseqpred/apps/esm_cli.py)
Inputs
- FASTA file
- optional metadata TSV for
id-familynaming mode
Core modules
core/embeddings/esm2.pycore/io/read.pycore/io/keys.py
Command
pepseqpred-esm \
--fasta-file data/targets.fasta \
--out-dir data/esm2 \
--embedding-key-mode id-family \
--key-delimiter - \
--model-name esm2_t33_650M_UR50D \
--max-tokens 1022 \
--batch-size 8Output
- per-protein embedding files under
<out-dir>/artifacts/pts/*.pt - embedding index CSV under
<out-dir>/artifacts/*.csv - optional shard-specific outputs when
--num-shards > 1
Length feature note:
--seq-len-feature {none,raw,inverse}controls whether a sequence-length scalar is appended to every residue embedding.- The default is
none.rawappendsfloat(seq_len);inverseappends1.0 / seq_len.
CLI: pepseqpred-labels (src/pepseqpred/apps/labels_cli.py)
Inputs
- preprocessed metadata TSV
- one or more embedding directories
Core module
core/labels/builder.py
Command
pepseqpred-labels \
data/input_data_20_4_10_all.tsv \
data/labels/labels_shard_000.pt \
--emb-dir data/esm2/artifacts/pts/shard_000 \
--restrict-to-embeddings \
--calc-pos-weight \
--embedding-key-delim -Output
- label shard
.ptwith protein label tensors and peptide metadata - optional
class_statspayload when--calc-pos-weightis enabled
CLI: pepseqpred-train (src/pepseqpred/apps/train_cli.py)
Unified run interface
--n-folds 1: one holdout run per split/train seed pair (uses--val-frac)--n-folds K(K > 1): K-fold members per split/train seed pair set--split-seedsand--train-seedsare paired by index; if both are omitted, both default to--seed--model-head ffnnis the default and preserves existing dense-head behavior--model-head conv1dadds a local Conv1d feature stack before the dense residue classifier--seq-len-feature {none,raw,inverse}records whether embeddings include an appended sequence-length feature; use the same mode used during embedding generation
Core modules
core/data/proteindataset.pycore/models/ffnn.pycore/train/{trainer,split,ddp,metrics,threshold,weights,seed,embedding}.py
Command (smoke)
pepseqpred-train \
--embedding-dirs data/esm2/artifacts/pts/shard_000 \
--label-shards data/labels/labels_shard_000.pt \
--epochs 1 \
--model-head ffnn \
--subset 100 \
--save-path data/models/ffnn_smoke \
--results-csv data/models/ffnn_smoke/runs.csvFor a local sequence head, use --model-head conv1d and optionally tune --conv-channels, --conv-layers, --conv-kernel-size, and --conv-dropout.
Submit one SLURM training job with multiple datasets (PV1 + CWP + BKP)
scripts/hpc/train.sh accepts multiple embedding directories and multiple label shards in one call:
- all embedding dirs first
- separator
-- - all label shard
.ptfiles after--
# Example: use per-dataset shard outputs together in one training run
EMB_DIRS=(
/scratch/$USER/esm2/pv1/artifacts/pts/shard_000
/scratch/$USER/esm2/pv1/artifacts/pts/shard_001
/scratch/$USER/esm2/pv1/artifacts/pts/shard_002
/scratch/$USER/esm2/pv1/artifacts/pts/shard_003
/scratch/$USER/esm2/cwp/artifacts/pts/shard_000
/scratch/$USER/esm2/cwp/artifacts/pts/shard_001
/scratch/$USER/esm2/cwp/artifacts/pts/shard_002
/scratch/$USER/esm2/cwp/artifacts/pts/shard_003
/scratch/$USER/esm2/bkp/artifacts/pts/shard_000
/scratch/$USER/esm2/bkp/artifacts/pts/shard_001
/scratch/$USER/esm2/bkp/artifacts/pts/shard_002
/scratch/$USER/esm2/bkp/artifacts/pts/shard_003
)
LABEL_SHARDS=(
/scratch/$USER/labels/pv1/labels_shard_000.pt
/scratch/$USER/labels/pv1/labels_shard_001.pt
/scratch/$USER/labels/pv1/labels_shard_002.pt
/scratch/$USER/labels/pv1/labels_shard_003.pt
/scratch/$USER/labels/cwp/labels_shard_000.pt
/scratch/$USER/labels/cwp/labels_shard_001.pt
/scratch/$USER/labels/cwp/labels_shard_002.pt
/scratch/$USER/labels/cwp/labels_shard_003.pt
/scratch/$USER/labels/bkp/labels_shard_000.pt
/scratch/$USER/labels/bkp/labels_shard_001.pt
/scratch/$USER/labels/bkp/labels_shard_002.pt
/scratch/$USER/labels/bkp/labels_shard_003.pt
)
sbatch train.sh "${EMB_DIRS[@]}" -- "${LABEL_SHARDS[@]}"Notes:
- Keep
SPLIT_TYPE=id-familyfor family-aware leakage control across PV1/CWP/BKP. - Protein IDs should be globally unique across all provided label shards/embedding dirs.
Outputs
- run checkpoint(s), usually
fully_connected.pt - per-run CSV (
runs.csvormulti_run_results.csv) - aggregate
multi_run_summary.json - ensemble manifest JSON when
--n-folds > 1
CLI: pepseqpred-train-optuna (src/pepseqpred/apps/train_optuna_cli.py)
Core modules
- same data/model/train stack as Stage 4
- Optuna trial orchestration in app layer
Command (smoke)
pepseqpred-train-optuna \
--embedding-dirs data/esm2/artifacts/pts/shard_000 \
--label-shards data/labels/labels_shard_000.pt \
--n-trials 2 \
--epochs 1 \
--save-path data/models/optuna_smoke \
--csv-path data/models/optuna_smoke/trials.csvCurrent Optuna search space
Optuna is fixed-head per study. --model-head ffnn samples the dense-head space. --model-head conv1d samples the same dense/optimizer settings plus convolutional head settings.
Hyperparameter (best_params key) |
Type | Search space (current implementation) | Controlled by |
|---|---|---|---|
model_head |
fixed | ffnn or conv1d |
--model-head |
depth |
integer | [depth_min, depth_max] |
--depth-min, --depth-max |
width_step |
categorical | {16, 32, 64} |
fixed in code |
base_width |
integer | [width_min, width_max] with step=width_step |
--width-min, --width-max |
shape_ratio |
float | [0.60, 0.95] |
sampled only when --arch-mode is bottleneck or pyramid |
dropout |
float | [0.00, 0.25] |
fixed in code |
use_layer_norm |
categorical | {True, False} |
fixed in code |
use_residual |
categorical | {True, False} |
fixed in code |
conv_channels |
categorical | values from --conv-channel-choices |
conv1d only |
conv_layers |
integer | [conv_layers_min, conv_layers_max] |
conv1d only |
conv_kernel_size |
categorical | odd values from --conv-kernel-size-choices |
conv1d only |
conv_dropout |
float | [conv_dropout_min, conv_dropout_max] |
conv1d only |
learning_rate |
float (log) | [lr_min, lr_max] |
--lr-min, --lr-max |
weight_decay |
float (log) | [wd_min, wd_max] |
--wd-min, --wd-max |
batch_size |
categorical | values from --batch-sizes CSV |
--batch-sizes |
Architecture shaping behavior:
--arch-mode flat: hidden widths are[base_width] * depth--arch-mode bottleneck: widths decrease byshape_ratioacross layers--arch-mode pyramid: widths increase byshape_ratioacross layers
Not tuned by Optuna in the current setup:
pos_weight(fixed for the study via--pos-weight, or computed once from label shards if omitted)- split strategy and validation fraction (
--split-type,--val-frac) - sequence windowing (
--window-size,--stride) and data-loader behavior - sequence-length feature mode (
--seq-len-feature; defaults tonone) - trial budget/pruning controls (
--n-trials,--epochs,--pruner-warmup,--timeout-s) - optimization target metric selection (
--metric) is user-selected, then maximized by Optuna
HPC default override note:
- The CLI default for
--batch-sizesis32,64,128. - The SLURM wrapper
scripts/hpc/trainoptuna.shcurrently overrides this to256,512,1024unless changed via env var.
Outputs
- trial rows CSV
- study storage (if configured)
- per-trial checkpoints under
trials/trial_* best_trial.jsonand copied best checkpoint
CLI: pepseqpred-predict (src/pepseqpred/apps/prediction_cli.py)
Accepted model artifact types
- single checkpoint
.pt - ensemble manifest
.json(schema v1 or v2)
Core modules
core/predict/artifacts.pycore/predict/inference.py
Command
pepseqpred-predict \
data/models/run_001/fully_connected.pt \
data/inference_targets.fasta \
--output-fasta data/predictions/predictions.fastaOutput
- FASTA containing binary residue masks
Length feature note:
--seq-len-feature autois the default and resolves from checkpointmodel_config.seq_len_feature.- A missing checkpoint key means no appended sequence-length feature. Use
--seq-len-feature rawonly when predicting with older or explicit raw-length models.
CLI: pepseqpred-eval-ffnn (src/pepseqpred/apps/evaluate_ffnn_cli.py)
Capabilities
- evaluate single checkpoint or ensemble manifest
- optional set auto-selection from
runs.csv - optional fold-level metrics and ROC/PR curves
- optional plot generation
Core modules
core/predict/artifacts.pycore/predict/inference.pycore/data/proteindataset.pycore/train/metrics.py
Command
pepseqpred-eval-ffnn \
data/models/run_001/fully_connected.pt \
--embedding-dirs data/esm2/artifacts/pts/shard_000 \
--label-shards data/labels/labels_shard_000.pt \
--output-json data/eval/ffnn_eval_summary.jsonOutput
- residue-level evaluation JSON
- optional fold payloads, curves, and plot files
The stable Python API is implemented in:
src/pepseqpred/api/predictor.pysrc/pepseqpred/api/pretrainedregistry.pysrc/pepseqpred/api/types.py
Top-level exports (import pepseqpred):
load_pretrained_predictorlist_pretrained_modelsload_predictorpredict_sequencepredict_fastaPepSeqPredictorPredictionResult
Bundled pretrained registry currently includes:
flagship1-v1(alias: flagship1)flagship2-v1(aliases: flagship2, default)
- tensor shape:
(L, D)by default, or(L, D+1)when--seq-len-feature raworinversewas used L: residue count- optional final feature column stores either raw sequence length or inverse sequence length
{
"labels": {"<protein_id>": Tensor[(L,3)] or Tensor[(L,)]},
"proteins": {"<protein_id>": {"tax_info": {...}, "peptides": [...]}}
# optional:
"class_stats": {"pos_count": int, "neg_count": int, "pos_weight": float}
}{
"model_state_dict": ...,
"optim_state_dict": ...,
"epoch": int,
"config": {...},
"model_config": {
# seq_len_feature is omitted for no appended length feature
# or set to "raw" / "inverse" when present
},
"best_loss": float,
"metrics": {...}
}- schema v1: single set,
memberslist - schema v2: root
setslist withset_index, each withmembers - members are filtered by
status == "OK"in predict/eval resolution
| CLI | File | Purpose |
|---|---|---|
pepseqpred-prepare-dataset |
apps/prepare_dataset_cli.py |
normalize PV1/CWP/BKP into shared training contract |
pepseqpred-preprocess |
apps/preprocess_cli.py |
metadata + z-score preprocessing |
pepseqpred-esm |
apps/esm_cli.py |
ESM-2 embedding generation |
pepseqpred-labels |
apps/labels_cli.py |
residue label shard generation |
pepseqpred-train |
apps/train_cli.py |
unified holdout/K-fold model-head training (--n-folds) |
pepseqpred-train-optuna |
apps/train_optuna_cli.py |
fixed-head Optuna tuning |
pepseqpred-predict |
apps/prediction_cli.py |
FASTA inference from checkpoint/manifest |
pepseqpred-eval-ffnn |
apps/evaluate_ffnn_cli.py |
residue-level evaluation |
These wrappers are production-facing interfaces and should be treated as first-class entrypoints.
| Script | Stage | Default resources |
|---|---|---|
generateembeddings.sh |
Embeddings | GPU, array 0-3, a100, 2 CPU/GPU, 8G/GPU, 01:00:00 |
generatelabels.sh |
Labels | CPU, 1 CPU, 16G, 01:00:00 |
train.sh |
Train model head | GPU, 4xa100, 20 CPU, 256G, 12:00:00 |
trainoptuna.sh |
Optuna | GPU, 4xa100, 20 CPU, 448G, 48:00:00 |
predictepitope.sh |
Predict | GPU, a100, 4 CPU, 32G, 00:30:00 |
evaluateffnn.sh |
End-to-end eval pipeline | GPU, a100, 8 CPU, 128G, 04:00:00 |
evalffnnsweep.sh |
Set-indexed eval batch submitter | wrapper script (calls evaluateffnn.sh) |
preprocessdata.sh |
Preprocess helper | local helper, not a SLURM script |
evaluateffnn.shorchestrates prepare, embed, labels, predict, eval, and peptide compare stages with stage toggles (RUN_PREP,RUN_EMBED,RUN_LABELS,RUN_PREDICT,RUN_EVAL,RUN_COMPARE).evaluateffnn.shandevalffnnsweep.shdepend onscripts/tools/cocci_eval_pipeline.py.- HPC wrappers expect
.pyzruntime artifacts in the working directory (for exampleesm.pyz,train.pyz,predict.pyz).
| Tool | Purpose |
|---|---|
buildpyz.py |
build .pyz runtime apps from src/pepseqpred |
pyzapps.py |
registry of app target names to module entrypoints |
cocci_eval_pipeline.py |
Cocci-specific eval subset prep and peptide compare |
rename_embeddings_id_family.py |
rename ID.pt to ID-family.pt embeddings from metadata |
Build examples:
python scripts/tools/buildpyz.py --list
python scripts/tools/buildpyz.py esm
python scripts/tools/buildpyz.py allTest suites are organized as:
tests/unit/: module-level behavior and edge casestests/integration/: CLI-level smoke and interactionstests/e2e/: train-to-predict boundary validation
Representative coverage areas include:
- API registry and predictor behavior (
tests/unit/api/*) - dataset, embeddings, labels, predict, train internals (
tests/unit/core/*) - CLI parsers and eval selection/curve logic (
tests/unit/apps/*) - checkpoint/manifest prediction/evaluation smoke tests (
tests/integration/*)
Run sequence:
ruff check .
pytest tests/unit
pytest tests/integration
pytest tests/e2econda env create -f envs/environment.local.yml
conda activate pepseqpred
pip install -e .[dev]For GPU/HPC development:
conda env create -f envs/environment.hpc.yml
conda activate pepseqpred
pip install -e .[dev]python -m venv .venv
. .venv/bin/activate # Linux/macOS
pip install --upgrade pip
pip install -e .[dev]Windows PowerShell:
python -m venv .venv
.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -e .[dev]When changing training or evaluation logic:
- preserve split semantics (
idvsid-family) - preserve seed handling and deterministic run planning
- avoid rank-dependent side effects in DDP code paths
- only write shared artifacts from intended rank
- avoid output path collisions between experiments
- prefer smoke tests over expensive full retraining during development
id-familyembedding key mode requires metadata family mapping.- Label generation must align embedding naming with
--embedding-key-delim(""forID.pt,-forID-family.pt). - Prediction/evaluation threshold overrides must remain in
(0.0, 1.0). - Ensemble manifests are resolved by valid
status=OKmembers and optionalk_foldstruncation. - For HPC pipelines, keep
.pyzartifacts and shell scripts in the same execution directory unless intentionally using module imports.
- GitHub issues: bug reports, feature requests, and development questions
- Maintainers: Jeffrey Hoelzel, Jason Ladner