DSM 650M Base is a 650M-parameter diffusion-based protein language model built on an ESM2-650M encoder with a masked diffusion objective for unified representation learning and generative design. The service provides GPU-accelerated endpoints for embeddings (mean, per-residue, CLS) of sequences up to 2,048 residues, log-probability and perplexity scoring, and diffusion-based sequence generation via masking or unconditional sampling. Typical applications include enzyme and binder design, fitness-guided mutational scans, and large-scale functional annotation workflows.
Predict¶
Score protein sequences with log-probability and perplexity using DSM 650M Base
- POST /api/v3/dsm-650m-base/predict/¶
Predict endpoint for DSM 650M Base.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, optional) — Configuration parameters:
num_sequences (int, range: 1-32, default: 1) — Number of sequences to generate per input item
temperature (float, range: 0.1-2.0, default: 1.0) — Sampling temperature
top_k (int, minimum: 1, optional, default: null) — Top-k sampling cutoff (null = disabled)
top_p (float, range: 0.0-1.0, optional, default: null) — Nucleus sampling threshold (null = disabled)
max_length (int, range: 10-2048, optional, default: null) — Maximum sequence length (null = derived from input)
step_divisor (int, range: 1-1000, default: 100) — Step divisor for diffusion
remasking (string, default: “random”) — Remasking strategy:
“low_confidence” — Remask lowest-confidence tokens
“random” — Remask random tokens
“low_logit” — Remask tokens with lowest logits
“dual” — Combined remasking strategy
items (array of objects, min: 1, max: 1) — Input sequences:
sequence (string, max length: 2048, optional, default: “”) — Amino acid sequence that may include “<mask>” and “<eos>” tokens; remaining characters must be unambiguous amino acids if present
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of arrays of objects) — One result per input item, in the order requested:
Each inner array has length =
num_sequencesfrom the requestsequence (string) — Generated amino acid sequence, length: 1–2048, alphabet: unambiguous amino acids, special tokens removed
log_prob (float) — Total log probability of
sequence, summed over positions, natural logarithmperplexity (float) —
exp(-log_prob / L)whereLis the length ofsequencesequence2 (string, optional) — Second generated amino acid sequence for PPI variants, length: 1–2048, omitted for base variants
results (array of objects) — One result per input item, in the order requested:
sequence_index (int) — Zero-based index of the input sequence in the request
embeddings (array of floats, optional) — Mean-pooled embedding vector, size: model-dependent hidden_dim (e.g., 640–2560), omitted if not requested
per_residue_embeddings (array of arrays of floats, optional) — Per-residue embeddings with shape
[L, hidden_dim], whereLis input sequence length (1–2048) and hidden_dim is model-dependent, omitted if not requestedcls_embeddings (array of floats, optional) — [CLS] token embedding vector, size: hidden_dim, omitted if not requested
results (array of objects) — One result per input item, in the order requested:
log_prob (float) — Total log probability of the input sequence, natural logarithm
perplexity (float) —
exp(-log_prob / L)whereLissequence_lengthsequence_length (int) — Length of the scored sequence in residues, range: 1–2048
Example response:
Encode¶
Encode protein sequences into mean, per-residue, and CLS embeddings using DSM 650M Base
- POST /api/v3/dsm-650m-base/encode/¶
Encode endpoint for DSM 650M Base.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, optional) — Encoder configuration parameters:
include (array of strings, default: [“mean”]) — Embedding representations to return; allowed values: “mean”, “per_residue”, “cls”
items (array of objects, min: 1, max: 16, required) — Input protein sequences:
sequence (string, min length: 1, max length: 2048, required) — Protein sequence using AAExtendedPlusExtra alphabet with “-” allowed
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of objects) — One result per input item, in the order requested:
sequence_index (int) — Zero-based index of the input sequence in the request
embeddings (array of floats, optional) — Mean-pooled embedding vector for the sequence (size: d, where d is the DSM hidden dimension for the selected model size)
per_residue_embeddings (array of arrays of floats, optional) — Per-residue embedding vectors (shape: [L, d], where L is the sequence length and d is the DSM hidden dimension for the selected model size)
cls_embeddings (array of floats, optional) — [CLS] token embedding vector (size: d, where d is the DSM hidden dimension for the selected model size)
Example response:
Generate¶
Diffuse and sample new protein variants from a masked seed sequence using DSM 650M Base
- POST /api/v3/dsm-650m-base/generate/¶
Generate endpoint for DSM 650M Base.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, optional) — Generation parameters:
num_sequences (int, range: 1-32, default: 1) — Number of sequences to generate per input item
temperature (float, range: 0.1-2.0, default: 1.0) — Sampling temperature
top_k (int, optional, min: 1, default: null) — Top-k sampling cutoff (null = disabled)
top_p (float, optional, range: 0.0-1.0, default: null) — Nucleus sampling threshold (null = disabled)
max_length (int, optional, range: 10-2048, default: null) — Maximum generated sequence length (null = inferred from input)
step_divisor (int, range: 1-1000, default: 100) — Step divisor controlling the number of diffusion steps
remasking (string, enum: [“low_confidence”, “random”, “low_logit”, “dual”], default: “random”) — Remasking strategy used during diffusion generation
items (array of objects, min: 1, max: 1) — Input specification:
sequence (string, max length: 2048, optional, default: “”) — Input protein sequence that may include “<mask>” and “<eos>” tokens; remaining characters must be unambiguous amino acids if present (empty string allowed for unconditional generation)
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of arrays of objects) — One result per input item, in the order requested:
[i] (array of objects) — Generated sequences for input item i, length =
num_sequencessequence (string) — Generated amino acid sequence, length: 1–2048 residues, alphabet: unambiguous amino acids
log_prob (float) — Total log probability of
sequenceunder the model, units: natural logperplexity (float, > 0.0) —
exp(-log_prob / sequence_length)wheresequence_lengthis the number of residues insequencesequence2 (string, optional) — Second generated amino acid sequence for PPI variants, length: 1–2048 residues, alphabet: unambiguous amino acids
Example response:
Performance¶
Representation quality (DSM encode) relative to other BioLM models: - DSM 650M Base yields higher downstream linear‑probe performance than its ESM‑2 650M backbone across secondary structure, EC/GO/InterPro, stability, and PPI benchmarks when using frozen embeddings - On the same probe suite, DSM 650M Base matches or exceeds similarly sized encoder‑only models such as E1 600M and ESM C 300M in weighted F1, while approaching the average classification performance of substantially larger sequence‑to‑sequence models like ProstT5 AA2Fold - Versus generative encoders such as ProGen2 Medium and ProGen2 BFD90, DSM 650M Base embeddings are more linearly predictive of functional (GO/EC) and structural labels, making it preferable when encoding or annotation quality is the primary objective
Generative performance (DSM generate) versus other BioLM protein generators: - Compared with ESM‑2 650M used in mask‑fill mode, DSM 650M Base maintains higher reconstruction F1 and Alignment Score (ASc) at high corruption (≥50–90% masked), while MLM reconstructions degrade rapidly; at 90% masking, DSM 650M Base reaches ASc ≈0.27, over four standard deviations above random natural sequence pairs - Relative to discrete diffusion models such as DSM 150M Base and DPLM‑like approaches, DSM 650M Base matches or slightly exceeds alignment (ASc) at high mask rates but improves token‑wise F1 by up to ~30–35% under heavy corruption - For unconditional generation, DSM 650M Base reproduces amino‑acid 1–3‑mer and predicted secondary‑structure distributions with Jensen–Shannon divergence <0.01 to natural validation sequences, while still sampling from a distinct distribution that is useful for de novo design rather than memorization
Scoring and calibration (DSM score) versus other encoders: - When reconstructing masked sequences (via generate) or scoring unmasked sequences (via score), DSM 650M Base shows a slower increase in cross‑entropy with mask rate than ESM‑2 650M and ESM‑2 3B on OMG‑derived validation and test sets, so perplexity remains informative even when >70% of residues are perturbed - Diffusion‑style training improves calibration under strong distribution shift (e.g., heavy random masking or template corruption) compared with pure MLMs, making DSM 650M Base more reliable for ranking candidates by log probability or perplexity in aggressive redesign workloads
Comparison within the DSM family and practical implications: - Compared to DSM 150M Base, DSM 650M Base consistently improves reconstruction ASc and F1 across 5–90% masking and better matches natural amino‑acid 3‑mer and 9‑class secondary‑structure distributions, closing much of the gap to very large encoders while remaining smaller and cheaper than models like ESM‑2 3B or ProstT5 AA2Fold - DSM 650M Base provides more PPI‑relevant signal than non‑diffusion encoders of similar size on HPPI‑like linear probes, but for strongly target‑conditioned binder design tasks, DSM 650M PPI or Synteract2 remain preferable and can be combined with DSM 650M Base as a fast front‑end generator whose outputs are filtered and ranked by structure and PPI predictors
Applications¶
De novo protein sequence generation for industrial enzyme leads, using DSM 650M’s unconditional diffusion sampling (empty or fully masked input via the
generatorendpoint) to propose biomimetic but novel proteins whose amino acid and predicted secondary-structure statistics track natural proteins, reducing the need for brute-force random mutagenesis when starting new engineering campaigns for detergents, food processing, or chemical manufacturingHigh-corruption sequence reconstruction to explore local fitness landscapes, by masking large regions (e.g., 30–90%) of a known enzyme sequence with
<mask>tokens and calling thegeneratorendpoint to inpaint the missing residues in a single diffusion run, enabling rapid in silico generation of diverse yet structurally coherent variant panels for stability, activity, or specificity optimization when experimental screening capacity is limitedProtein representation for downstream predictive models, by encoding sequences with the
encoderendpoint (mean, per-residue, or CLS embeddings) and feeding frozen embeddings into task-specific heads (e.g., thermostability, solubility, expression, or process tolerance predictors), improving model performance over standard MLM encoders while keeping supervised training data and compute moderate for proprietary property predictors in commercial enzyme programsTemplate-guided redesign of known enzyme scaffolds, using the
generatorendpoint to partially mask an existing catalytic or binding region in a validated protein and regenerate key segments, allowing exploration of sequence space around established motifs while tending to preserve global fold-like patterns; useful for affinity or specificity tuning, but not a replacement for 3D docking, physics-based affinity prediction, or full developability assessmentLikelihood-based filtering and ranking of candidate enzyme variants, by scoring libraries with the
predictorendpoint to obtain log-probability and perplexity under DSM 650M, then prioritizing sequences that are closer to the learned distribution of natural proteins, which helps triage large in silico or DNA-synthesized libraries before costly wet-lab testing, while still requiring orthogonal structure/function prediction and experimental validation for final selection
Limitations¶
Maximum sequence length and batching: DSM 650M Base accepts protein sequences up to
2048residues on all endpoints. ForDSMGenerateRequestItem.sequence,DSMEncodeRequestItem.sequence, andDSMScoreRequestItem.sequence, themax_lengthis2048(withmin_length=1where enforced). Thegeneratorendpoint (DSMGenerateRequest) currently supports a single input item per request (itemshasmin_length=1,max_length=1) with up tonum_sequences=32outputs per item. Theencoderandpredictorendpoints (DSMEncodeRequestandDSMScoreRequest) accept up to16sequences per request (itemshasmax_length=16). Larger workloads must be split client-side across multiple API calls.Token and alphabet restrictions: All endpoints are protein-only. Non-amino-acid symbols (including nucleotides) are rejected. For generation,
DSMGenerateRequestItem.sequencemay be empty (unconditional generation) or contain standard amino acids plus"<mask>"and"<eos>"; after stripping these special tokens, any remaining characters must be valid unambiguous amino acids or the request will fail validation. For encoding,DSMEncodeRequestItem.sequenceusesAAExtendedPlusExtra(extra=["-"]), allowing the extended amino acid alphabet plus the gap character"-". For scoring,DSMScoreRequestItem.sequencemust contain strictly unambiguous amino acids (no gaps, no"<mask>", no"<eos>"). Mixed alphabets, masked tokens in encoding/scoring, or other symbols are not supported.Diffusion generation behavior and quality–speed trade-offs: The
generatorendpoint uses a masked diffusion process, not autoregressive sampling. Sequence quality and runtime are sensitive toDSMGenerateRequestParams.step_divisorandDSMGenerateRequestParams.remasking. Lowerstep_divisorvalues (>=1, default100) increase diffusion steps and usually improve reconstruction/realism but slow inference; higher values speed up sampling at the cost of more noise. The defaultremasking="random"is robust but not optimal for every domain; alternative values ("low_confidence","low_logit","dual") may change convergence and diversity but are not guaranteed to improve function. Aggressive sampling settings near the bounds (e.g.,temperatureclose to0.1or2.0, extremetop_k/top_p) can yield low-likelihood or unrealistic proteins even if the call is valid.Embeddings and scores are sequence-only and context-agnostic: DSM 650M Base provides sequence-level embeddings and log-probability–based scores for single proteins only. The
encoderendpoint returns representations such asembeddings(mean-pooled),per_residue_embeddings(shape[seq_len, hidden_dim]), andcls_embeddings, but these are generic features, not direct predictions of stability, activity, binding, or expression. Thepredictorendpoint exposes only totallog_probandperplexityper sequence. Structure, ligands, textual prompts, or partner sequences are not inputs to these Base endpoints (PPI conditioning andsequence2outputs are only available in DSM PPI variants via different routing). Per-residue outputs are large and can be memory-intensive for long sequences and large batches.Scientific and algorithmic limitations: DSM 650M is trained on unlabeled protein sequence corpora and evaluated mostly with in-silico metrics (reconstruction, secondary structure surrogates, annotation proxies, and binding predictors). The API does not ensure that generated proteins will express, fold, or function as intended in any organism or assay without experimental validation. Unconditional or weakly conditioned generation tends to produce “biomimetic but generic’’ sequences; without strong templates, targeted conditioning, or external filters, DSM is not suited to highly localized design tasks (e.g., fine active-site sculpting, antibody CDR grafting, or tight specificity changes). Its diffusion objective focuses on denoising high mask rates and global context and is not a substitute for structure-native 3D design tools when atomic geometry or specific interfaces are the primary constraint.
When DSM 650M Base is not the optimal choice: DSM 650M Base is best used as a sequence-level generator/encoder that feeds into broader design and ranking pipelines. It is generally not the right tool for: (1) final ranking among a small set of high-value candidates where structure- or physics-based models (e.g., structure predictors, 3D diffusion/energy models) provide better discrimination; (2) codon- or nucleotide-level design, where protein-only models cannot handle DNA/RNA constraints; (3) strict binder co-design workflows that require joint modeling of two partners—these should use DSM PPI variants or dedicated PPI/affinity models, not the Base
encoder/predictor; (4) antibody/nanobody structure prediction, where antibody-specialized 3D models typically outperform sequence-only pLMs; or (5) applications that require sequences longer than2048residues or extremely large single-call batches, which must instead be truncated, tiled, or split across multiple requests.
How We Use It¶
DSM 650M Base enables end-to-end protein engineering workflows where a single model supports both sequence understanding and generative design. Its diffusion-style denoising objective lets teams use the same representations for ranking, clustering, and property prediction while also exploring sequence space via masked or unconditional generation at high mask rates. In practice, DSM 650M embeddings feed into similarity search, developability and function predictors, and structure tools (e.g., AlphaFold-class models), while its generator proposes variants around natural or template sequences that can be iteratively refined with new assay data using standardized, scalable APIs.
Typical applications include enzyme improvement, antibody and binder optimization (via template- and mask-based design), and rapid exploration of mutational neighborhoods around lead sequences.
DSM 650M’s unified representation–generation capability simplifies pipeline architecture: the same model can support encode/score/generate stages, reducing integration overhead and enabling reproducible multi-round design campaigns.
References¶
Hallee, L., Rafailidis, N., Bichara, D. B., & Gleghorn, J. P. (2025). Diffusion Sequence Models for Enhanced Protein Representation and Generation. bioRxiv.
