ESM C 600M is a 600M-parameter transformer protein language model (36 layers, width 1152, 18 attention heads) trained with masked language modeling on large UniRef, MGnify, and JGI sequence datasets. The API provides GPU-accelerated encoding and masked-token prediction for batches of up to 8 protein sequences, each up to 2048 residues. It returns mean and per-token embeddings from selectable layers, as well as logits, enabling downstream use in protein design, variant effect prediction, functional annotation, and large-scale representation learning workflows.
Predict¶
Predict masked residues in input sequences
- POST /api/v3/esmc-600m/predict/¶
Predict endpoint for ESM C 600M.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, optional) — Configuration parameters:
repr_layers (array of integers, default: [-1]) — Indices of model layers from which to return representations
include (array of strings, default: [“mean”]) — Output types to include; allowed values: “mean”, “per_token”, “logits”
items (array of objects, min: 1, max: 8) — Input sequences for encoding:
sequence (string, min length: 1, max length: 2048, required) — Protein sequence using extended amino acid codes plus “-” character
items (array of objects, min: 1, max: 8) — Input sequences for masked prediction:
sequence (string, min length: 1, max length: 2048, required) — Protein sequence using extended amino acid codes plus “<mask>” token; must contain at least one “<mask>” token
items (array of objects, min: 1, max: 8) — Input sequences for log probability calculation:
sequence (string, min length: 1, max length: 2048, required) — Protein sequence using unambiguous amino acid codes only
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of objects) — One result per input item, in the order requested:
embeddings (array of objects, optional) — Mean embeddings per requested layer
layer (int) — Layer index as specified in the request
embedding (array of floats, size: 960 for 300m, 1152 for 600m) — Mean embedding vector for the sequence
per_token_embeddings (array of objects, optional) — Per-token embeddings per requested layer
layer (int) — Layer index as specified in the request
embeddings (array of arrays of floats, shape: [sequence_length, embedding_size], embedding_size: 960 for 300m, 1152 for 600m) — Embedding vectors for each token position
logits (array of arrays of floats, optional, shape: [sequence_length, vocab_size]) — Raw logits for each token position; vocab_size: 20 (20 standard amino acids)
vocab_tokens (array of strings, optional, size: 20) — Vocabulary tokens corresponding to logits indices
results (array of objects) — One result per input item, in the order requested:
logits (array of arrays of floats, shape: [num_masked_positions, vocab_size]) — Raw logits for masked positions; vocab_size: 20 (20 standard amino acids)
sequence_tokens (array of strings, size: sequence_length) — Tokenized input sequence with mask tokens replaced by predicted tokens
vocab_tokens (array of strings, size: 20) — Vocabulary tokens corresponding to logits indices
Example response:
Encode¶
Generate embeddings for input sequences
- POST /api/v3/esmc-600m/encode/¶
Encode endpoint for ESM C 600M.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, optional) — Configuration parameters:
repr_layers (array of integers, default: [-1]) — Indices of transformer layers for which embeddings are returned
include (array of strings, default: [“mean”]) — Output types to include; allowed values: “mean”, “per_token”, “logits”
items (array of objects, min: 1, max: 8) — Input sequences:
sequence (string, min length: 1, max length: 2048, required) — Protein sequence using extended amino acid alphabet plus “-” character
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of objects) — One result per input item, in the order requested:
embeddings (array of objects, optional) — Mean embeddings per requested layer:
layer (int) — Layer index as returned for the requested representation (e.g. 35 for last layer in 600m model)
embedding (array of floats) — Mean embedding vector for the sequence; length depends on model size (1152 for 600m model)
per_token_embeddings (array of objects, optional) — Per-token embeddings per requested layer:
layer (int) — Layer index as returned for the requested representation
embeddings (array of arrays of floats) — Embedding matrix with shape [sequence_length, embedding_size]; embedding_size depends on model size (1152 for 600m model)
logits (array of arrays of floats, optional) — Logit scores with shape [sequence_length, vocab_size]; vocab_size = 33; values are unbounded floats
vocab_tokens (array of strings, optional, size: 33) — Vocabulary tokens corresponding to indices in logits
Example response:
Performance¶
ESM C 600M is hosted on NVIDIA T4 GPUs with 16 GB memory, providing GPU-accelerated inference for both encoding (embeddings/logits) and masked sequence prediction endpoints.
Relative to BioLM’s ESM-2 650M:
Delivers substantially higher predictive accuracy on structure-informed benchmarks (e.g., contact precision P@L on CASP15) while maintaining comparable latency at the same batch size (up to 8 sequences of length 2048).
Uses an optimized transformer architecture (Pre-LN, rotary embeddings, SwiGLU, bias-free layers) to reduce memory footprint and improve tokens-per-second throughput for long protein sequences.
Relative to BioLM’s larger ESM-2 3B and 15B models:
Matches or exceeds the predictive performance of ESM-2 3B and approaches ESM-2 15B on representation-learning tasks, with roughly 5× fewer parameters than 3B and much lower GPU memory requirements.
Offers a better accuracy–throughput trade-off for high-throughput embedding and masked residue scoring pipelines, enabling similar-quality downstream models at significantly lower compute cost.
Within the ESM C family on BioLM, the 600M variant typically achieves near-6B-level contact-map and representation quality for many applications, while being substantially faster and cheaper to serve than 6B, making it preferable when large-scale batching or low latency is a priority.
Applications¶
Protein embedding generation for downstream predictive modeling, enabling rapid screening of large sequence libraries for properties such as thermostability, expression, or catalytic efficiency; suitable as an input to custom machine learning models in enzyme engineering and general protein optimization, but not a direct predictor of structure or function on its own.
Unsupervised exploration and clustering of protein sequence space using embeddings from the encoder endpoint to identify novel families, scaffolds, or domains; useful for biotech teams mining metagenomic or proprietary sequence datasets for candidates with distinct sequence signatures, but not a substitute for experimental functional characterization.
Representation learning for protein variant effect modeling, where embeddings serve as features in supervised models to prioritize mutations that improve stability, activity, or developability; valuable for protein engineering workflows that combine in silico ranking with high-throughput screening, while still requiring labeled data for the specific target and assay of interest.
Embedding-based protein similarity search and retrieval, using encoder-generated vectors to build approximate nearest-neighbor indices and quickly locate sequences related to known functional benchmarks; effective for hit expansion and scaffold hopping in enzyme discovery pipelines, though it does not guarantee conservation of fine-grained active-site chemistry.
Using masked-token predictions from the predictor endpoint for mutation suggestion and plausibility checks (e.g., proposing amino acid substitutions at specific positions and scoring them via log probabilities); helpful for narrowing large mutational spaces to evolutionarily consistent variants, but not sufficient alone for precise quantitative predictions such as ΔΔG or kinetic parameters.
Limitations¶
Maximum Sequence Length: Input sequences are limited to
2048amino acids; longer sequences must be truncated or processed in segments.Batch Size: The maximum allowed batch size per request is
8sequences; larger datasets must be split into multiple requests.GPU Type: Inference is performed on
T4GPUs; performance may vary depending on sequence length, requestedrepr_layers, and whetherincludeoptions likeper_tokenorlogitsare used.ESM C 600M is optimized for representation learning and masked token prediction; it does not perform full-sequence generative design or 3D structure prediction through this API.
The model is trained on natural protein sequences; sequences with extensive non-standard residues or ambiguous tokens may yield less meaningful embeddings or logits.
For antibody-specific structural or affinity optimization (e.g., detailed CDR modeling), specialized antibody design and structure models generally provide better performance than embeddings or masked predictions from ESM C 600M.
How We Use It¶
BioLM uses ESM C 600M as a core encoder in protein design workflows, generating rich sequence representations that drive downstream predictive models and in silico screening. These embeddings enable rapid filtering, ranking, and clustering of candidates for enzyme engineering, antibody optimization, and variant prioritization, and are often combined with masked-residue scoring from the predictor endpoint and external structure- or property-based models to support multi-round, lab-in-the-loop optimization.
Encodings from ESM C 600M serve as standardized inputs to custom property predictors (e.g., activity, stability, developability), improving model performance with minimal labeled data.
Masked prediction scores from ESM C 600M help identify tolerated and beneficial mutations, guiding focused mutagenesis libraries and reducing experimental search space.
References¶
ESM Team (2024). “ESM Cambrian: Revealing the mysteries of proteins with unsupervised learning.” EvolutionaryScale Website, December 4, 2024. https://www.evolutionaryscale.ai/blog/esm-cambrian
