ProstT5 Fold2AA is a GPU-accelerated bilingual protein language model that translates structural protein representations encoded as Foldseek 3Di tokens (lowercase a–y) into amino acid sequences. The service exposes encoder and generator endpoints for 3Di-to-AA workflows, supporting up to 16 sequences per batch for embedding and up to 2 sequences per batch for conditional sequence generation (up to 512 residues, ≤3 samples per input). Typical use cases include structure-guided sequence design, inverse folding, remote homolog exploration, and structure-aware library generation.
Encode¶
Generate embeddings for input fold (3Di) sequences
- POST /api/v3/prostt5-fold2aa/encode/¶
Encode endpoint for ProstT5 Fold2AA.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, optional) — Configuration parameters:
temperature (float, range: 0.0-8.0, default: 1.0) — Sampling temperature for sequence generation
top_p (float, range: 0.0-1.0, default: 0.85) — Nucleus sampling cumulative probability threshold
top_k (int, range: 1-20, default: 3) — Number of highest probability tokens to consider at each step
repetition_penalty (float, range: 0.0-3.0, default: 1.2) — Penalty factor for repeated tokens
num_samples (int, range: 1-3, default: 1) — Number of sequences to generate per input sequence
items (array of objects, min: 1, max: 2) — Input sequences:
sequence (string, min length: 1, max length: 512, required) — Input sequence as a 3Di token string using characters “acdefghiklmnpqrstvwy”
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of objects) — One result per input item, in the order requested:
mean_representation (array of floats, size: 1024) — Mean embedding vector for the input sequence, derived from the ProstT5 encoder
Example response:
Generate¶
Generate amino acid sequences from input fold (3Di) sequences
- POST /api/v3/prostt5-fold2aa/generate/¶
Generate endpoint for ProstT5 Fold2AA.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, optional) — Configuration parameters:
temperature (float, range: 0.0-8.0, default: 1.2) — Sampling temperature for token generation
top_p (float, range: 0.0-1.0, default: 0.95) — Nucleus sampling cumulative probability threshold
top_k (int, range: 1-20, default: 6) — Number of highest probability tokens considered at each decoding step
repetition_penalty (float, range: 0.0-3.0, default: 1.2) — Penalty factor applied to repeated tokens during decoding
num_samples (int, range: 1-3, default: 1) — Number of generated sequences returned per input sequence
num_beams (int, range: 1-3, default: 3) — Beam search width used during decoding
items (array of objects, min: 1, max: 2) — Input sequences:
sequence (string, min length: 1, max length: 512, required) — Amino acid sequence using standard extended amino acid codes
params (object, optional) — Configuration parameters:
temperature (float, range: 0.0-8.0, default: 1.0) — Sampling temperature for token generation
top_p (float, range: 0.0-1.0, default: 0.85) — Nucleus sampling cumulative probability threshold
top_k (int, range: 1-20, default: 3) — Number of highest probability tokens considered at each decoding step
repetition_penalty (float, range: 0.0-3.0, default: 1.2) — Penalty factor applied to repeated tokens during decoding
num_samples (int, range: 1-3, default: 1) — Number of generated sequences returned per input sequence
items (array of objects, min: 1, max: 2) — Input sequences:
sequence (string, min length: 1, max length: 512, required) — 3Di sequence using characters “acdefghiklmnpqrstvwy”
items (array of objects, min: 1, max: 16) — Input sequences:
sequence (string, min length: 1, max length: 1000, required) — Amino acid sequence using standard extended amino acid codes
items (array of objects, min: 1, max: 16) — Input sequences:
sequence (string, min length: 1, max length: 1000, required) — 3Di sequence using characters “acdefghiklmnpqrstvwy”
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of arrays of objects) — One result per input item, in the order requested:
sequence (string) — Generated sequence; amino acids (upper-case A–Z, including X, B, Z, U, O) or 3Di tokens (lower-case
acdefghiklmnpqrstvwy), length: 1–512 characters
Example response:
Performance¶
Hardware and precision: inference for both encoder (embeddings) and generator (3Di→AA) runs on NVIDIA L4 GPUs in mixed half-precision (fp16), matching the ProstT5 architecture and giving similar throughput to ProtT5 for embedding-only workloads.
Inverse folding accuracy vs. ProteinMPNN: ProstT5 Fold2AA sequences evaluated with ESMFold achieve an average lDDT of ~0.72, RMSD ~2.9 Å, and TM-score ~0.58 to their target structures, compared with ProteinMPNN’s ~0.77 lDDT, 2.61 Å RMSD, and 0.61 TM-score to the same templates. This places ProstT5 slightly below ProteinMPNN in structural fidelity while using a sequence-to-sequence transformer instead of a graph neural network.
Sequence diversity and design trade-off: ProstT5 Fold2AA designs show lower pairwise identity to native sequences (~22% PIDE) than ProteinMPNN designs (~30% PIDE) for similar target structures, indicating higher sequence diversity at modest cost in lDDT/TM-score. This can be useful when exploring broader sequence space around a given fold.
Optimization relative to other BioLM models: compared with encoder-only protein LMs on BioLM (e.g., ProtT5- or ESM-2–style embedding endpoints), ProstT5 Fold2AA adds an auto-regressive decoding step and sampling controls (beam search up to width 3, top-p/top-k, temperature, repetition penalty). This makes generation slower per sequence than pure embedding extraction but substantially cheaper and faster than structure-based inverse-design workflows that require explicit 3D prediction (e.g., AlphaFold2 + ProteinMPNN).
Applications¶
Rapid screening of large protein and metagenomic sequence collections for remote homolog detection by using ProstT5 to translate amino acid sequences into 3Di structural tokens and then searching with Foldseek; enables structure-level sensitivity at near sequence-search speed for companies mining sequence databases to discover novel protein folds or distant structural homologs; not suitable when atomic-resolution models or detailed active-site geometries are required.
High-throughput inverse folding for protein design by generating candidate amino acid sequences from target 3Di structural strings, allowing teams to propose diverse sequences expected to adopt a desired fold or scaffold before downstream structure prediction and wet-lab testing; useful for exploring backbone-compatible variants in early design cycles, but requires additional filters and experimental validation for functional activity or developability.
Structural embedding extraction from 3Di token strings with the ProstT5 encoder for fast fold-level classification, clustering, and annotation transfer (e.g. CATH/SCOPe-like grouping) across large protein libraries; valuable for biotech pipelines where sequence identity is low but structural similarity informs function or platform selection; less informative for tasks dominated by local chemistry (e.g. specific catalytic mechanisms) rather than overall fold.
Generation and ranking of structurally constrained sequence variants by iterating between 3Di representations and ProstT5-generated amino acid sequences, enabling protein engineering workflows to sample diverse sequences predicted to preserve a given structural framework for stability or solubility optimization; not optimal as a standalone tool for fine-tuning binding specificity or catalytic efficiency without complementary predictive models and assays.
Accelerated structural redundancy reduction and clustering of proprietary protein collections by first predicting 3Di strings from sequences with ProstT5 (outside this API) and then encoding or searching those 3Di strings via the API, allowing rapid removal of structurally similar candidates before expensive structure prediction or experimental screening; less appropriate for applications that depend on precise domain boundaries or quaternary structure modeling.
Limitations¶
Maximum Sequence Length: ProstT5 supports sequences up to
max_sequence_len = 1000residues forencoderequests and up tomax_sequence_len = 512residues forgeneraterequests. Longer amino acid or 3Di sequences must be truncated or split before submission.Batch Size: The maximum
batch_sizeis16forencodeand2forgenerate. Larger datasets must be processed in multiple API calls.ProstT5 is optimized for well-structured, globular proteins. Because training filtered out low-confidence AlphaFold2 models and highly repetitive 3Di strings, performance may degrade on intrinsically disordered proteins, very short peptides, transmembrane segments with poorly defined folds, or highly repetitive 3Di sequences.
ProstT5-predicted 3Di sequences and derived embeddings are highly effective for rapid fold-based remote homology detection and structure-aware tasks, but they do not replace full 3D structure prediction or comparison methods (e.g. AlphaFold2/ESMFold plus structural alignment) when fine-grained atomic detail is required.
ProstT5 embeddings are biased toward structural information learned from 3Di; tasks dominated by functional context (e.g. detailed subcellular localization, ligand-binding specificity, GO-term prediction) may perform worse than with general-purpose protein language models such as ProtT5 used alone.
ProstT5 inverse folding (
fold2AAgeneration) can propose diverse amino acid sequences consistent with an input 3Di fold, but sequence quality and structural fidelity are typically lower than specialized graph-based design tools (e.g. ProteinMPNN) for precise protein engineering or therapeutic design.
How We Use It¶
ProstT5 Fold2AA lets us start from a target 3Di structure string and generate diverse amino acid sequences predicted to adopt that structural pattern, which is valuable for inverse folding, library design around known folds, and exploring remote sequence solutions for a given scaffold. In practice, we pair ProstT5 Fold2AA with ESMFold for rapid structural plausibility checks and Foldseek or other ProstT5-based encoders for remote homology analysis, embedding-based filtering, and prioritization in protein engineering campaigns.
Enables scalable inverse-folding libraries (3Di→AA) that can be triaged with downstream structural prediction and biophysical filters.
Integrates into broader BioLM workflows that use sequence embeddings, remote homology search, and downstream stability/activity models to focus experimental effort on the most promising designs.
References¶
Heinzinger, M., Weissenow, K., Gomez Sanchez, J., Henkel, A., Mirdita, M., Steinegger, M., & Rost, B. (2024). Bilingual language model for protein sequence and structure. NAR Genomics and Bioinformatics, 6(4), lqae150. https://doi.org/10.1093/nargab/lqae150
