ProteinMPNN is a graph-based inverse folding model for fixed-backbone protein sequence design, supporting single- and multi-chain complexes, symmetry tying, and order-agnostic decoding. Given atomic protein backbones (PDB), it generates diverse amino acid sequences in ~1–2 s per 100 residues on GPU, with typical native sequence recovery ~50–55% on native and designed monomers/oligomers. The API supports batched design, temperature-controlled sampling, amino acid biases/omissions, per-residue constraints, symmetry coupling, and log-probability scores for ranking candidates in enzyme, binder, nanoparticle, and stability redesign workflows.
Generate¶
This endpoint gensats for ProteinMPNN.
- POST /api/v3/protein-mpnn/generate/¶
Generate endpoint for ProteinMPNN.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, required) — Configuration parameters:
temperature (float, default: 0.1) — Sampling temperature
fixed_residues (array of strings, default: []) — Residue identifiers to keep fixed
redesigned_residues (array of strings, default: []) — Residue identifiers to redesign
bias_AA (object, default: {}) — Per–amino acid bias; keys are single-letter amino acid codes, values are floats
bias_AA_per_residue (object, default: {}) — Per-residue amino acid bias; keys are residue identifiers, values are objects mapping single-letter amino acid codes to floats
omit_AA (string, default: “”) — Concatenated single-letter amino acid codes to omit globally
omit_AA_per_residue (object, default: {}) — Per-residue amino acids to omit; keys are residue identifiers, values are strings of single-letter amino acid codes
symmetry_residues (array of arrays of strings, default: []) — Groups of residue identifiers constrained together
symmetry_weights (array of arrays of floats, default: []) — Weights corresponding to symmetry_residues groups
homo_oligomer (boolean, default: False) — Homooligomer design flag
chains_to_design (array of strings, default: []) — Chain IDs to design
parse_these_chains_only (array of strings, default: []) — Chain IDs to parse from the PDB
parse_atoms_with_zero_occupancy (boolean, default: False) — Include atoms with zero occupancy
number_of_batches (int, range: 1–1, default: 1) — Number of batches
batch_size (int, range: 1–2, default: 1) — Number of designs per batch
repack_everything (boolean, optional, default: False) — Repack all side chains flag
pack_side_chains (boolean, optional, default: False) — Enable side-chain packing
number_of_packs_per_design (int, optional, range: 1–8, default: 1) — Number of side-chain packing runs per design
sc_num_samples (int, optional, range: 1–64, default: 16) — Number of side-chain samples per design
sc_num_denoising_steps (int, optional, range: 1–10, default: 3) — Number of side-chain denoising steps
force_hetatm (boolean, optional, default: False) — Treat HETATM records as ligands
pack_with_ligand_context (boolean, optional, default: True) — Include ligand context when packing side chains
fasta_seq_separation (string, default: “:”) — Separator string for FASTA sequences
file_ending (string, default: “”) — Output file suffix
zero_indexed (int, default: 0) — Residue index numbering mode flag
pdb_path (null, fixed) — Unused field
redesigned_residues_multi (null, fixed) — Unused field
fixed_residues_multi (null, fixed) — Unused field
bias_AA_per_residue_multi (null, fixed) — Unused field
omit_AA_per_residue_multi (null, fixed) — Unused field
save_stats (null, fixed) — Unused field
verbose (boolean, default: True) — Verbose logging flag
ligand_mpnn_use_side_chain_context (null, fixed) — Unused field
ligand_mpnn_use_atom_context (boolean, optional, default: True) — Ligand-aware atom context flag
ligand_mpnn_cutoff_for_score (float, optional, default: 8.0) — Distance cutoff for ligand scoring in angstroms
global_transmembrane_label (string, optional, allowed: “membrane”, “soluble”, default: “soluble”) — Global transmembrane label
transmembrane_buried (array of strings, optional, default: None) — Residue identifiers labeled as buried transmembrane
transmembrane_interface (array of strings, optional, default: None) — Residue identifiers labeled as transmembrane interface
items (array of objects, min: 1, max: 1) — Input structures:
pdb (string, required, min length: 1, max length: max_pdb_str_len) — PDB-formatted structure text
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of objects) — One result per input item, in the order requested:
sequence (string) — Designed amino acid sequence using 20 standard residues
pdb (string) — PDB-formatted structure corresponding to the design backbone, possibly including remarks and modified occupancies
overall_confidence (float) — Unnormalized model score for the designed sequence, higher values indicate higher confidence
ligand_confidence (float) — Unnormalized model score for ligand or interface compatibility, higher values indicate higher confidence
seq_rec (float, range: 0.0–1.0) — Sequence recovery fraction
log_probs (array of arrays of floats, shape: [L, 20]) — Per-residue log-probabilities over 20 amino acids, one inner array per residue position
sampling_probs (array of arrays of floats, shape: [L, 20]) — Per-residue sampling probabilities over 20 amino acids, normalized per position
pdb_packed (object, optional) — Packed structure outputs for side-chain models keyed by packing context name
<key> (string) — Packed structure in PDB format for the given packing context
Example response:
Performance¶
Core design accuracy (fixed-backbone inverse folding): - On held-out native monomer backbones, ProteinMPNN recovers ~50–52% of wild‑type residues, versus ~33% for Rosetta fixed‑backbone design (≈15–20 absolute percentage‑point improvement) with ~200× lower per‑target compute cost than the original Rosetta CPU benchmarks (1.2 s vs 258.8 s for 100 residues in the paper) - Across 690 monomers, 732 homomers, and 98 heteromers, median sequence recovery is ~52%, ~55%, and ~51% respectively, with 90–95% recovery in buried core positions
Multichain and symmetry-aware performance: - Symmetry-aware decoding with logit averaging across equivalent positions improves homomer median recovery by ~2–3 percentage points over unconstrained decoding and yields higher structural consistency under AlphaFold2/ESMFold on assemblies - For symmetric oligomers and designed nanoparticles, experimental benchmarks show substantially higher rates of correct oligomerization and assembly than Rosetta-designed sequences on the same backbones (e.g., ~28% vs 0% correct oligomerization for C5/C6 cyclic oligomers; successful rescue of failed tetrahedral nanoparticles with crystal structures within ~1.2 Å Cα RMSD)
Robustness, downstream foldability, and comparison to other BioLM models: - Noise-trained ProteinMPNN variants (up to ~0.3 Å backbone noise) generate sequences that map more reliably to the target backbone under single-sequence AlphaFold2/ESMFold than un-noised models, producing ~2–3× more designs above high lDDT‑Cα thresholds at the cost of a small drop in native sequence recovery - Relative to general inverse-folding transformers (e.g., ESM‑IF1, ProstT5 Fold2AA), ProteinMPNN achieves higher native sequence recovery, especially in buried/core residues, and performs better on large or symmetric assemblies because it encodes explicit chain identity and symmetry‑averaged logits - Compared to unconditional sequence LMs used with structure-prediction filters (e.g., ESM‑1v, ProGen2, Boltz‑2), ProteinMPNN typically yields 5–10× more sequences predicted (by AlphaFold2/ESMFold) to fold to the exact target backbone per unit of downstream structure‑prediction compute
Throughput and ranking in BioLM workflows: - ProteinMPNN is substantially lighter than structure-prediction models provided by BioLM (AlphaFold2, ESMFold, ProstT5 AA2Fold): redesign of a 100‑residue monomer is typically >100× faster and far less GPU‑intensive than a full AlphaFold2 run, and several‑fold faster than an ESMFold forward pass, enabling tens to hundreds of designs per backbone before expensive validation - Average per‑residue log‑probability correlates strongly with native sequence recovery and downstream AlphaFold2/ESMFold confidence, so ProteinMPNN scores can be used to rank and prune large design sets, reducing the number of heavy structure‑prediction calls needed per successful design relative to ranking solely by AlphaFold2/ESMFold metrics
Applications¶
Fixed–backbone sequence design to improve stability and expression of industrial proteins and biologics from an experimentally determined or AI-predicted 3D backbone, enabling rapid redesign of poorly expressed or unstable candidates (for example, detergent-compatible proteases or Fc-fusion scaffolds) without re-running expensive structure modeling; less appropriate when no reliable backbone is available or when large backbone rearrangements are required
Symmetry-aware sequence design for self-assembling protein nanoparticles and multimeric scaffolds, where ProteinMPNN can tie symmetric interface residues across chains via
symmetry_residuesto generate consistent sequences for Cn, Dn, or polyhedral assemblies, allowing vaccine and delivery-platform companies to iterate on tetrahedral or cyclic cage designs while keeping the scaffold geometry fixedInterface redesign for protein–protein binders and adaptor proteins by mixing fixed and designable regions on multi-chain backbones (for example, target chains in
fixed_residues, binder chains inchains_to_design) to improve binding affinity, specificity, or solubility for pre-modeled complexes such as receptor–ligand or cytokine–receptor pairs; most useful when the backbone pose is reasonably correct and less suitable when interface geometry is uncertainBackbone-locked library generation for display technologies (phage, yeast, mRNA display), where a single validated backbone (for example, a de novo scaffold around a small-molecule pocket or a stabilized repeat protein) is resampled at higher
temperatureto produce diverse, structure-consistent sequences, enabling focused libraries enriched for correctly folding variants rather than random mutagenesis around one wild-type sequenceIn silico stability-focused re-encoding of existing manufacturing workhorses (such as secreted proteases, binding scaffolds, or diagnostic proteins) by redesigning non-functional regions while keeping catalytic or binding residues in
fixed_residues, to improve expression yield, thermostability, or aggregation resistance prior to DNA synthesis and cell-line development; this is most effective when functional residues and backbone are well characterized and does not replace downstream developability profiling or cell-based testing
Limitations¶
Maximum sequence length and PDB size: ProteinMPNN only designs residues present in the input PDB and is limited to backbones where each chain has at most
MPNNParams.max_sequence_len = 1024residues. Longer chains must be truncated or split before submission. The request body enforces a single PDB per item (MPNNGenerateRequest.itemshasmin_items=1,max_items=1) and eachMPNNGenerateRequestItemhas a single PDB string (pdb), so very large complexes may need to be decomposed into smaller design problems.Batching and throughput constraints: Each request can generate a limited number of designs.
params.number_of_batchesis constrained to1 <= number_of_batches <= MPNNParams.num_batches (1)andparams.batch_sizeto1 <= batch_size <= MPNNParams.batch_size (2), so at most two sequences are generated per request for a given PDB. This API is not intended for exhaustive large-scale sampling; for very high-throughput sequence exploration, a faster, lower-fidelity generator (e.g. language-model–based) is often better as a first pass, with ProteinMPNN used for focused redesign.Backbone-dependence and structure quality: ProteinMPNN is a fixed-backbone inverse folding model; it assumes the input PDB backbone is close to the desired folded state and does not optimize or relax coordinates. It is robust to modest noise (trained with Gaussian backbone noise) but will faithfully encode whatever geometry is provided, including artifacts (e.g. strained loops, clashes, unrealistic hallucinations). For highly uncertain backbones or early-stage backbone exploration, upstream backbone-generative or physics-based modeling is recommended before calling this API.
What the model optimizes (and what it does not): Outputs are optimized for compatibility of amino-acid identities with the given backbone and local environment, not for binding affinity, catalytic activity, immunogenicity, or expression in a specific host. Model types such as
MPNNModelTypes.SOLUBLE,MPNNModelTypes.GLOBAL_LABEL_MEMBRANE,MPNNModelTypes.PER_RESIDUE_LABEL_MEMBRANE,MPNNModelTypes.LIGAND, andMPNNModelTypes.SIDE_CHAINbias sequences toward solubility, membrane context, ligand context, or side-chain packing, but do not replace task-specific scoring (e.g. docking, binding-energy calculations, epitope prediction). For strong functional constraints, ProteinMPNN should be one stage in a broader design–filter–rank pipeline.Symmetry, residue specification, and interface design limits: Residue-level controls such as
fixed_residues,redesigned_residues,bias_AA_per_residue,omit_AA_per_residue,symmetry_residues,transmembrane_buried, andtransmembrane_interfacemust reference residues present in the input PDB; invalid chain IDs or out-of-range residue numbers cause validation errors. Symmetry is enforced only where explicitly specified viasymmetry_residuesand optionalsymmetry_weights; the model does not infer symmetry or assembly stoichiometry from the PDB. For complex, multi-interface designs (e.g. large heteromeric assemblies or diffuse PPI networks), additional interface-specific modeling beyond this API is typically required.Sampling diversity and downstream compatibility: Diversity is controlled primarily via
params.temperatureand amino-acid bias/omit options (bias_AA,bias_AA_per_residue,omit_AA,omit_AA_per_residue). Higher temperatures increase sequence diversity but usually lower model log-probability (overall_confidence) and experimental success rates. The API does not produce embeddings or learned feature vectors; responses contain sequences (sequence) plus per-position probabilities/log-probabilities (sampling_probs,log_probs). For representation learning (clustering, visualization, downstream ML), combine ProteinMPNN with a separate embedding model (e.g. ESM, ProtT5).
How We Use It¶
BioLM uses ProteinMPNN as a core sequence-design engine within iterative protein engineering workflows, where scalable APIs enable teams to move directly from structural hypotheses (experimental or predicted) to diverse sequence panels for enzymes, antibodies, membrane proteins, ligand-binding domains, and multi-component assemblies. ProteinMPNN-generated libraries feed into structure prediction (e.g., AlphaFold/RoseTTAFold), backbone-generation tools, sequence-embedding models, and downstream developability predictors so that binder design, nanoparticle assembly, interface optimization, and stability tuning can be run as end-to-end, lab-in-the-loop campaigns rather than isolated modeling steps.
ProteinMPNN designs are routinely combined with BioLM predictive models for stability, expression, solubility, immunogenicity, and biophysical properties, enabling automatic ranking and triage of candidates before synthesis.
The API-centric design supports multi-round optimization campaigns (e.g., affinity maturation, thermostability tuning, solubility rescue) by integrating with existing data-science pipelines, LIMS, and experimental feedback loops without one-off tooling for each project.
References¶
Dauparas, J., Anishchenko, I., Bennett, N., Bai, H., Ragotte, R. J., Milles, L. F., Wicky, B. I. M., Courbet, A., de Haas, R. J., Bethel, N., Leung, P. J. Y., Huddy, T. F., Pellock, S., Tischer, D., Chan, F., Koepnick, B., Nguyen, H., Kang, A., Sankaran, B., Bera, A. K., King, N. P., & Baker, D. (2022). Robust deep learning–based protein sequence design using ProteinMPNN. Science, 378(6615), 49–56. https://doi.org/10.1126/science.add2187
