PepSeqPred Developer README

PepSeqPred logo

PepSeqPred Developer README

This README is the developer-facing reference for the full PepSeqPred training and evaluation pipeline.

For lightweight inference usage and API quickstart, use README.pypi.md.

Scope

PepSeqPred supports two usage profiles:

PyPI quickstart profile (pip install pepseqpred): user-facing inference API with bundled pretrained artifacts and artifact-path inference helpers.
Repository developer profile (pip install -e .[dev]): full source tree for preprocessing, embeddings, label generation, model-head training, Optuna tuning, prediction, evaluation, and HPC orchestration.

The repository profile is the source of truth for reproducing experiments end-to-end.

Repository Map

Path	Purpose
`src/pepseqpred/apps/`	CLI entrypoints for each pipeline stage
`src/pepseqpred/core/preprocess/`	Metadata and z-score preprocessing
`src/pepseqpred/core/embeddings/`	ESM-2 sequence embedding generation
`src/pepseqpred/core/labels/`	Residue-level label construction
`src/pepseqpred/core/data/`	Iterable dataset and windowing/padding logic
`src/pepseqpred/core/models/`	FFNN model definitions
`src/pepseqpred/core/train/`	DDP, splitting, metrics, thresholds, trainer, seeds, weights
`src/pepseqpred/core/predict/`	Checkpoint/manifest resolution and inference logic
`src/pepseqpred/core/io/`	FASTA/TSV readers, key parsing, logging, CSV appends
`src/pepseqpred/api/`	Stable Python inference API and pretrained registry
`scripts/hpc/`	SLURM wrappers for each production stage
`scripts/tools/`	Zipapp build tools and Cocci eval prep/compare tooling
`tests/`	Unit, integration, and e2e coverage
`envs/`	Conda environment specs for local and HPC

End-to-End Pipeline

Stage 1  normalize dataset inputs (PV1/CWP/BKP) to a shared training contract
Stage 2  generate ESM-2 per-residue embeddings
Stage 3  build residue-level label shards
Stage 4  train model head (unified n-fold interface, DDP-aware)
Stage 5  optional Optuna tuning (DDP-aware)
Stage 6  predict residue masks from checkpoint/manifest
Stage 7  evaluate residue metrics (+ optional Cocci peptide compare)

Stage Reference

Stage 1: Multi-Dataset Prepare (PV1/CWP/BKP)

*Note: data/sample/ contains the expected dataset sample formats.

CLI: pepseqpred-prepare-dataset (src/pepseqpred/apps/prepare_dataset_cli.py)

This stage is the recommended entrypoint when training on one or more of:

PV1 (human virome)
CWP/Cocci (fungal)
BKP (bacterial)

It normalizes source-specific metadata and FASTA headers into a shared PV1-compatible contract (i.e., ID= AC= OXX=) so downstream embedding, label generation, and training CLIs can be reused unchanged.

Core module

src/pepseqpred/core/preprocess/preparedataset.py

Required output contract per dataset

prepared_targets.fasta
prepared_labels_metadata.tsv
prepared_embedding_metadata.tsv
prepare_summary.json

PV1 inputs and command

metadata TSV
z-score TSV
protein FASTA

pepseqpred-prepare-dataset \
  data/PV1/PV1_meta_2020-11-23_cleaned.tsv \
  data/PV1/prepared \
  --dataset-kind pv1 \
  --protein-fasta data/PV1/PV1_targets.fasta \
  --z-file data/PV1/PV1_zscores.tsv

CWP/Cocci inputs and command

metadata TSV
protein FASTA
reactive code list TSV
non-reactive code list TSV

pepseqpred-prepare-dataset \
  data/Cocci/CWP_metadata.tsv \
  data/Cocci/prepared \
  --dataset-kind cwp \
  --protein-fasta data/Cocci/CWP_targets.faa \
  --reactive-codes data/Cocci/CWP_reactive_Z20N4.tsv \
  --nonreactive-codes data/Cocci/CWP_nonreactive_Z20N4.tsv

BKP inputs and command

metadata TSV
protein FASTA
reactive code list TSV
non-reactive code list TSV

pepseqpred-prepare-dataset \
  data/BKP/BKP_metadata.tsv \
  data/BKP/prepared \
  --dataset-kind bkp \
  --protein-fasta data/BKP/BKP.faa \
  --reactive-codes data/BKP/BKP_reactive_Z20N4.tsv \
  --nonreactive-codes data/BKP/BKP_nonreactive_Z20N4.tsv

Dataset-specific grouping used for leakage-aware splitting (--split-type id-family)

PV1: family from PV1 OXX
CWP/Cocci: Cluster50ID mapped to deterministic numeric IDs
BKP: reClusterID_70 mapped to deterministic numeric IDs

Next stages after prepare

run pepseqpred-esm with --embedding-key-mode id-family and each dataset's prepared_embedding_metadata.tsv
run pepseqpred-labels with --embedding-key-delim -
train with --split-type id-family

Stage 1 (Legacy): PV1 Z-Score Preprocess

CLI: pepseqpred-preprocess (src/pepseqpred/apps/preprocess_cli.py)

Inputs

metadata TSV
z-score TSV

Core modules

core/preprocess/pv1.py
core/preprocess/zscores.py
core/io/read.py

Command

pepseqpred-preprocess data/meta.tsv data/zscores.tsv --save

Output

training-ready metadata TSV with Def epitope, Uncertain, Not epitope
default filename pattern: input_data_<is_epi_z>_<is_epi_min_subs>_<not_epi_z>_<not_epi_max_subs|all>.tsv

Stage 2: Generate ESM-2 Embeddings

CLI: pepseqpred-esm (src/pepseqpred/apps/esm_cli.py)

Inputs

FASTA file
optional metadata TSV for id-family naming mode

Core modules

core/embeddings/esm2.py
core/io/read.py
core/io/keys.py

Command

pepseqpred-esm \
  --fasta-file data/targets.fasta \
  --out-dir data/esm2 \
  --embedding-key-mode id-family \
  --key-delimiter - \
  --model-name esm2_t33_650M_UR50D \
  --max-tokens 1022 \
  --batch-size 8

Output

per-protein embedding files under <out-dir>/artifacts/pts/*.pt
embedding index CSV under <out-dir>/artifacts/*.csv
optional shard-specific outputs when --num-shards > 1

Length feature note:

--seq-len-feature {none,raw,inverse} controls whether a sequence-length scalar is appended to every residue embedding.
The default is none. raw appends float(seq_len); inverse appends 1.0 / seq_len.

Stage 3: Build Residue Labels

CLI: pepseqpred-labels (src/pepseqpred/apps/labels_cli.py)

Inputs

preprocessed metadata TSV
one or more embedding directories

Core module

core/labels/builder.py

Command

pepseqpred-labels \
  data/input_data_20_4_10_all.tsv \
  data/labels/labels_shard_000.pt \
  --emb-dir data/esm2/artifacts/pts/shard_000 \
  --restrict-to-embeddings \
  --calc-pos-weight \
  --embedding-key-delim -

Output

label shard .pt with protein label tensors and peptide metadata
optional class_stats payload when --calc-pos-weight is enabled

Stage 4: Train Model Head

CLI: pepseqpred-train (src/pepseqpred/apps/train_cli.py)

Unified run interface

--n-folds 1: one holdout run per split/train seed pair (uses --val-frac)
--n-folds K (K > 1): K-fold members per split/train seed pair set
--split-seeds and --train-seeds are paired by index; if both are omitted, both default to --seed
--model-head ffnn is the default and preserves existing dense-head behavior
--model-head conv1d adds a local Conv1d feature stack before the dense residue classifier
--seq-len-feature {none,raw,inverse} records whether embeddings include an appended sequence-length feature; use the same mode used during embedding generation

Core modules

core/data/proteindataset.py
core/models/ffnn.py
core/train/{trainer,split,ddp,metrics,threshold,weights,seed,embedding}.py

Command (smoke)

pepseqpred-train \
  --embedding-dirs data/esm2/artifacts/pts/shard_000 \
  --label-shards data/labels/labels_shard_000.pt \
  --epochs 1 \
  --model-head ffnn \
  --subset 100 \
  --save-path data/models/ffnn_smoke \
  --results-csv data/models/ffnn_smoke/runs.csv

For a local sequence head, use --model-head conv1d and optionally tune --conv-channels, --conv-layers, --conv-kernel-size, and --conv-dropout.

Submit one SLURM training job with multiple datasets (PV1 + CWP + BKP)

scripts/hpc/train.sh accepts multiple embedding directories and multiple label shards in one call:

all embedding dirs first
separator --
all label shard .pt files after --

# Example: use per-dataset shard outputs together in one training run
EMB_DIRS=(
  /scratch/$USER/esm2/pv1/artifacts/pts/shard_000
  /scratch/$USER/esm2/pv1/artifacts/pts/shard_001
  /scratch/$USER/esm2/pv1/artifacts/pts/shard_002
  /scratch/$USER/esm2/pv1/artifacts/pts/shard_003
  /scratch/$USER/esm2/cwp/artifacts/pts/shard_000
  /scratch/$USER/esm2/cwp/artifacts/pts/shard_001
  /scratch/$USER/esm2/cwp/artifacts/pts/shard_002
  /scratch/$USER/esm2/cwp/artifacts/pts/shard_003
  /scratch/$USER/esm2/bkp/artifacts/pts/shard_000
  /scratch/$USER/esm2/bkp/artifacts/pts/shard_001
  /scratch/$USER/esm2/bkp/artifacts/pts/shard_002
  /scratch/$USER/esm2/bkp/artifacts/pts/shard_003
)

LABEL_SHARDS=(
  /scratch/$USER/labels/pv1/labels_shard_000.pt
  /scratch/$USER/labels/pv1/labels_shard_001.pt
  /scratch/$USER/labels/pv1/labels_shard_002.pt
  /scratch/$USER/labels/pv1/labels_shard_003.pt
  /scratch/$USER/labels/cwp/labels_shard_000.pt
  /scratch/$USER/labels/cwp/labels_shard_001.pt
  /scratch/$USER/labels/cwp/labels_shard_002.pt
  /scratch/$USER/labels/cwp/labels_shard_003.pt
  /scratch/$USER/labels/bkp/labels_shard_000.pt
  /scratch/$USER/labels/bkp/labels_shard_001.pt
  /scratch/$USER/labels/bkp/labels_shard_002.pt
  /scratch/$USER/labels/bkp/labels_shard_003.pt
)

sbatch train.sh "${EMB_DIRS[@]}" -- "${LABEL_SHARDS[@]}"

Notes:

Keep SPLIT_TYPE=id-family for family-aware leakage control across PV1/CWP/BKP.
Protein IDs should be globally unique across all provided label shards/embedding dirs.

Outputs

run checkpoint(s), usually fully_connected.pt
per-run CSV (runs.csv or multi_run_results.csv)
aggregate multi_run_summary.json
ensemble manifest JSON when --n-folds > 1

Stage 5: Optuna Tuning (Optional)

CLI: pepseqpred-train-optuna (src/pepseqpred/apps/train_optuna_cli.py)

Core modules

same data/model/train stack as Stage 4
Optuna trial orchestration in app layer

Command (smoke)

pepseqpred-train-optuna \
  --embedding-dirs data/esm2/artifacts/pts/shard_000 \
  --label-shards data/labels/labels_shard_000.pt \
  --n-trials 2 \
  --epochs 1 \
  --save-path data/models/optuna_smoke \
  --csv-path data/models/optuna_smoke/trials.csv

Current Optuna search space

Optuna is fixed-head per study. --model-head ffnn samples the dense-head space. --model-head conv1d samples the same dense/optimizer settings plus convolutional head settings.

Hyperparameter (`best_params` key)	Type	Search space (current implementation)	Controlled by
`model_head`	fixed	`ffnn` or `conv1d`	`--model-head`
`depth`	integer	`[depth_min, depth_max]`	`--depth-min`, `--depth-max`
`width_step`	categorical	`{16, 32, 64}`	fixed in code
`base_width`	integer	`[width_min, width_max]` with `step=width_step`	`--width-min`, `--width-max`
`shape_ratio`	float	`[0.60, 0.95]`	sampled only when `--arch-mode` is `bottleneck` or `pyramid`
`dropout`	float	`[0.00, 0.25]`	fixed in code
`use_layer_norm`	categorical	`{True, False}`	fixed in code
`use_residual`	categorical	`{True, False}`	fixed in code
`conv_channels`	categorical	values from `--conv-channel-choices`	conv1d only
`conv_layers`	integer	`[conv_layers_min, conv_layers_max]`	conv1d only
`conv_kernel_size`	categorical	odd values from `--conv-kernel-size-choices`	conv1d only
`conv_dropout`	float	`[conv_dropout_min, conv_dropout_max]`	conv1d only
`learning_rate`	float (log)	`[lr_min, lr_max]`	`--lr-min`, `--lr-max`
`weight_decay`	float (log)	`[wd_min, wd_max]`	`--wd-min`, `--wd-max`
`batch_size`	categorical	values from `--batch-sizes` CSV	`--batch-sizes`

Architecture shaping behavior:

--arch-mode flat: hidden widths are [base_width] * depth
--arch-mode bottleneck: widths decrease by shape_ratio across layers
--arch-mode pyramid: widths increase by shape_ratio across layers

Not tuned by Optuna in the current setup:

pos_weight (fixed for the study via --pos-weight, or computed once from label shards if omitted)
split strategy and validation fraction (--split-type, --val-frac)
sequence windowing (--window-size, --stride) and data-loader behavior
sequence-length feature mode (--seq-len-feature; defaults to none)
trial budget/pruning controls (--n-trials, --epochs, --pruner-warmup, --timeout-s)
optimization target metric selection (--metric) is user-selected, then maximized by Optuna

HPC default override note:

The CLI default for --batch-sizes is 32,64,128.
The SLURM wrapper scripts/hpc/trainoptuna.sh currently overrides this to 256,512,1024 unless changed via env var.

Outputs

trial rows CSV
study storage (if configured)
per-trial checkpoints under trials/trial_*
best_trial.json and copied best checkpoint

Stage 6: Predict

CLI: pepseqpred-predict (src/pepseqpred/apps/prediction_cli.py)

Accepted model artifact types

single checkpoint .pt
ensemble manifest .json (schema v1 or v2)

Core modules

core/predict/artifacts.py
core/predict/inference.py

Command

pepseqpred-predict \
  data/models/run_001/fully_connected.pt \
  data/inference_targets.fasta \
  --output-fasta data/predictions/predictions.fasta

Output

FASTA containing binary residue masks

Length feature note:

--seq-len-feature auto is the default and resolves from checkpoint model_config.seq_len_feature.
A missing checkpoint key means no appended sequence-length feature. Use --seq-len-feature raw only when predicting with older or explicit raw-length models.

Stage 7: Evaluate

CLI: pepseqpred-eval-ffnn (src/pepseqpred/apps/evaluate_ffnn_cli.py)

Capabilities

evaluate single checkpoint or ensemble manifest
optional set auto-selection from runs.csv
optional fold-level metrics and ROC/PR curves
optional plot generation

Core modules

core/predict/artifacts.py
core/predict/inference.py
core/data/proteindataset.py
core/train/metrics.py

Command

pepseqpred-eval-ffnn \
  data/models/run_001/fully_connected.pt \
  --embedding-dirs data/esm2/artifacts/pts/shard_000 \
  --label-shards data/labels/labels_shard_000.pt \
  --output-json data/eval/ffnn_eval_summary.json

Output

residue-level evaluation JSON
optional fold payloads, curves, and plot files

Inference API (`pepseqpred.api`)

The stable Python API is implemented in:

src/pepseqpred/api/predictor.py
src/pepseqpred/api/pretrainedregistry.py
src/pepseqpred/api/types.py

Top-level exports (import pepseqpred):

load_pretrained_predictor
list_pretrained_models
load_predictor
predict_sequence
predict_fasta
PepSeqPredictor
PredictionResult

Bundled pretrained registry currently includes:

flagship1-v1 (alias: flagship1)
flagship2-v1 (aliases: flagship2, default)

Artifact Contracts

Embedding `.pt`

tensor shape: (L, D) by default, or (L, D+1) when --seq-len-feature raw or inverse was used
L: residue count
optional final feature column stores either raw sequence length or inverse sequence length

Label shard `.pt`

{
  "labels": {"<protein_id>": Tensor[(L,3)] or Tensor[(L,)]},
  "proteins": {"<protein_id>": {"tax_info": {...}, "peptides": [...]}}
  # optional:
  "class_stats": {"pos_count": int, "neg_count": int, "pos_weight": float}
}

Training checkpoint `.pt`

{
  "model_state_dict": ..., 
  "optim_state_dict": ..., 
  "epoch": int,
  "config": {...},
  "model_config": {
    # seq_len_feature is omitted for no appended length feature
    # or set to "raw" / "inverse" when present
  },
  "best_loss": float,
  "metrics": {...}
}

Ensemble manifest JSON

schema v1: single set, members list
schema v2: root sets list with set_index, each with members
members are filtered by status == "OK" in predict/eval resolution

CLI Reference

CLI	File	Purpose
`pepseqpred-prepare-dataset`	`apps/prepare_dataset_cli.py`	normalize PV1/CWP/BKP into shared training contract
`pepseqpred-preprocess`	`apps/preprocess_cli.py`	metadata + z-score preprocessing
`pepseqpred-esm`	`apps/esm_cli.py`	ESM-2 embedding generation
`pepseqpred-labels`	`apps/labels_cli.py`	residue label shard generation
`pepseqpred-train`	`apps/train_cli.py`	unified holdout/K-fold model-head training (`--n-folds`)
`pepseqpred-train-optuna`	`apps/train_optuna_cli.py`	fixed-head Optuna tuning
`pepseqpred-predict`	`apps/prediction_cli.py`	FASTA inference from checkpoint/manifest
`pepseqpred-eval-ffnn`	`apps/evaluate_ffnn_cli.py`	residue-level evaluation

HPC Script Reference (`scripts/hpc`)

These wrappers are production-facing interfaces and should be treated as first-class entrypoints.

Script	Stage	Default resources
`generateembeddings.sh`	Embeddings	GPU, array `0-3`, `a100`, `2` CPU/GPU, `8G`/GPU, `01:00:00`
`generatelabels.sh`	Labels	CPU, `1` CPU, `16G`, `01:00:00`
`train.sh`	Train model head	GPU, `4xa100`, `20` CPU, `256G`, `12:00:00`
`trainoptuna.sh`	Optuna	GPU, `4xa100`, `20` CPU, `448G`, `48:00:00`
`predictepitope.sh`	Predict	GPU, `a100`, `4` CPU, `32G`, `00:30:00`
`evaluateffnn.sh`	End-to-end eval pipeline	GPU, `a100`, `8` CPU, `128G`, `04:00:00`
`evalffnnsweep.sh`	Set-indexed eval batch submitter	wrapper script (calls `evaluateffnn.sh`)
`preprocessdata.sh`	Preprocess helper	local helper, not a SLURM script

Important HPC notes

evaluateffnn.sh orchestrates prepare, embed, labels, predict, eval, and peptide compare stages with stage toggles (RUN_PREP, RUN_EMBED, RUN_LABELS, RUN_PREDICT, RUN_EVAL, RUN_COMPARE).
evaluateffnn.sh and evalffnnsweep.sh depend on scripts/tools/cocci_eval_pipeline.py.
HPC wrappers expect .pyz runtime artifacts in the working directory (for example esm.pyz, train.pyz, predict.pyz).

Zipapp and Tooling (`scripts/tools`)

Tool	Purpose
`buildpyz.py`	build `.pyz` runtime apps from `src/pepseqpred`
`pyzapps.py`	registry of app target names to module entrypoints
`cocci_eval_pipeline.py`	Cocci-specific eval subset prep and peptide compare
`rename_embeddings_id_family.py`	rename `ID.pt` to `ID-family.pt` embeddings from metadata

Build examples:

python scripts/tools/buildpyz.py --list
python scripts/tools/buildpyz.py esm
python scripts/tools/buildpyz.py all

Testing Map

Test suites are organized as:

tests/unit/: module-level behavior and edge cases
tests/integration/: CLI-level smoke and interactions
tests/e2e/: train-to-predict boundary validation

Representative coverage areas include:

API registry and predictor behavior (tests/unit/api/*)
dataset, embeddings, labels, predict, train internals (tests/unit/core/*)
CLI parsers and eval selection/curve logic (tests/unit/apps/*)
checkpoint/manifest prediction/evaluation smoke tests (tests/integration/*)

Run sequence:

ruff check .
pytest tests/unit
pytest tests/integration
pytest tests/e2e

Environment and Setup

Conda

conda env create -f envs/environment.local.yml
conda activate pepseqpred
pip install -e .[dev]

For GPU/HPC development:

conda env create -f envs/environment.hpc.yml
conda activate pepseqpred
pip install -e .[dev]

Pip + venv

python -m venv .venv
. .venv/bin/activate  # Linux/macOS
pip install --upgrade pip
pip install -e .[dev]

Windows PowerShell:

python -m venv .venv
.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -e .[dev]

Reproducibility and Safety Guardrails

When changing training or evaluation logic:

preserve split semantics (id vs id-family)
preserve seed handling and deterministic run planning
avoid rank-dependent side effects in DDP code paths
only write shared artifacts from intended rank
avoid output path collisions between experiments
prefer smoke tests over expensive full retraining during development

Known Operational Notes

id-family embedding key mode requires metadata family mapping.
Label generation must align embedding naming with --embedding-key-delim ("" for ID.pt, - for ID-family.pt).
Prediction/evaluation threshold overrides must remain in (0.0, 1.0).
Ensemble manifests are resolved by valid status=OK members and optional k_folds truncation.
For HPC pipelines, keep .pyz artifacts and shell scripts in the same execution directory unless intentionally using module imports.

Contact

GitHub issues: bug reports, feature requests, and development questions
Maintainers: Jeffrey Hoelzel, Jason Ladner

Name		Name	Last commit message	Last commit date
Latest commit History 246 Commits
.github/workflows		.github/workflows
data		data
dist		dist
docs		docs
envs		envs
notebooks		notebooks
scripts		scripts
src/pepseqpred		src/pepseqpred
tests		tests
web		web
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
PepSeqPred_logo_black.png		PepSeqPred_logo_black.png
PepSeqPred_logo_white.png		PepSeqPred_logo_white.png
README.md		README.md
README.pypi.md		README.pypi.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

PepSeqPred Developer README

Scope

Repository Map

End-to-End Pipeline

Stage Reference

Stage 1: Multi-Dataset Prepare (PV1/CWP/BKP)

Stage 1 (Legacy): PV1 Z-Score Preprocess

Stage 2: Generate ESM-2 Embeddings

Stage 3: Build Residue Labels

Stage 4: Train Model Head

Stage 5: Optuna Tuning (Optional)

Stage 6: Predict

Stage 7: Evaluate

Inference API (pepseqpred.api)

Artifact Contracts

Embedding .pt

Label shard .pt

Training checkpoint .pt

Ensemble manifest JSON

CLI Reference

HPC Script Reference (scripts/hpc)

Important HPC notes

Zipapp and Tooling (scripts/tools)

Testing Map

Environment and Setup

Conda

Pip + venv

Reproducibility and Safety Guardrails

Known Operational Notes

Contact

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Inference API (`pepseqpred.api`)

Embedding `.pt`

Label shard `.pt`

Training checkpoint `.pt`

HPC Script Reference (`scripts/hpc`)

Zipapp and Tooling (`scripts/tools`)

Packages