E1 150M is a retrieval-augmented protein encoder that produces masked language model logits and embeddings for individual residues and whole sequences. The API supports GPU-accelerated, batched inference (up to 8 items, 2,048 residues each, with up to 50 unaligned homologous context sequences) via encode and predict endpoints. Typical uses include zero-shot fitness scoring, variant ranking from masked predictions, and embedding extraction for downstream structural and protein engineering workflows.
Predict¶
Predict masked amino acids (‘?’) in query sequences, optionally conditioned on homologous context sequences.
- POST /api/v3/e1-150m/predict/¶
Predict endpoint for E1 150M.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, optional) — Configuration parameters:
repr_layers (array of integers, default: [-1]) — Layer indices for which to return representations
include (array of strings, allowed: [“mean”, “per_token”, “logits”], default: [“mean”]) — Representation types to include in the response
items (array of objects, min: 1, max: 8, required) — Input items:
sequence (string, min length: 1, max length: 2048, required) — Protein sequence using extended amino acid alphabet (ACDEFGHIKLMNPQRSTVWYBXZUO)
context_sequences (array of strings, max items: 50, optional) — Context sequences using extended amino acid alphabet (ACDEFGHIKLMNPQRSTVWYBXZUO), each with min length: 1, max length: 2048
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of objects) — One result per input item, in the order requested:
logits (array of arrays of floats, shape: [L, V]) — Unnormalized scores for each sequence position and vocabulary token, where L is the length of
sequence_tokensand V is the vocabulary sizesequence_tokens (array of strings, length: L) — Amino acid tokens for each position in the input
sequencevocab_tokens (array of strings, length: V) — Vocabulary tokens defining the column order of
logits
Example response:
Encode¶
Encode protein sequences with and without retrieval-augmented context, returning mean, per-token, and logits representations from selected layers.
- POST /api/v3/e1-150m/encode/¶
Encode endpoint for E1 150M.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, optional) — Configuration parameters:
repr_layers (array of integers, default: [-1]) — Indices of encoder layers to return in the response
include (array of strings, default: [“mean”]) — Representation types to include; allowed values: “mean”, “per_token”, “logits”
items (array of objects, min: 1, max: 8) — Input sequences to encode:
sequence (string, min length: 1, max length: 2048, required) — Amino acid sequence using extended alphabet (ACDEFGHIKLMNPQRSTVWYBXZUO)
context_sequences (array of strings, max items: 50, optional; each string min length: 1, max length: 2048) — Optional amino acid context sequences using extended alphabet (ACDEFGHIKLMNPQRSTVWYBXZUO)
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of objects) — One result per input item, in the order requested:
embeddings (array of objects, optional) — Layer-level pooled embeddings
layer (int) — Layer index used to generate the embedding (e.g., -1 for final layer)
embedding (array of floats) — Single vector embedding for the concatenated input (query + context); length equals the model hidden size for the selected E1 model (150m/300m/600m)
per_token_embeddings (array of objects, optional) — Layer-level per-token embeddings
layer (int) — Layer index used to generate the per-token embeddings (e.g., -1 for final layer)
embeddings (array of arrays of floats) — Per-token vectors; outer length equals the number of tokens in the concatenated input (query + context), inner length equals the model hidden size for the selected E1 model (150m/300m/600m)
logits (array of arrays of floats, optional) — Unnormalized scores over the model vocabulary; outer length equals the number of tokens in the concatenated input (query + context), inner length equals
len(vocab_tokens); values are real-valued and unboundedvocab_tokens (array of strings, optional) — Vocabulary tokens corresponding to the second dimension of
logits; order matches the inner dimension oflogitscontext_sequence_count (int, optional) — Number of context sequences used for the item; range: 0–50
Example response:
Performance¶
Model class and deployment - E1 150M is a 150M-parameter encoder-only Transformer, deployed on recent-generation NVIDIA data center GPUs (A100 / H100 class) with mixed-precision inference (FP16 / BF16) and fused attention kernels optimized for retrieval-augmented workloads - BioLM’s serving stack uses batched execution across up to 8 items per request, yielding predictable, roughly linear scaling with total token count (sum over query and context sequences) for the
encoderandpredictorendpointsRelative latency and throughput within the E1 family and vs. non-retrieval encoders - For common API workloads (encoding or log-probability-style scoring patterns via
encoder/ masked prediction viapredictorwith modest context), E1 150M typically runs about 1.6–1.8× faster than E1 300M and about 2.3–2.7× faster than E1 600M at similar total token counts - On sequence-only queries, per-token latency is modestly higher than non-retrieval models of similar size (e.g., ESM-2 150M) because of block-causal machinery, but wall-clock time remains within roughly 20–30% while providing retrieval support when context is presentPredictive accuracy vs. other BioLM encoders (ProteinGym v1.3, substitution assays) - In sequence-only use, E1 150M reaches average Spearman 0.401 and NDCG@10 0.744, slightly exceeding ESM-2 150M (0.387 Spearman) and comparable to ESM C 300M (0.406 Spearman) despite fewer parameters - With homologous context provided through the API, E1 150M reaches 0.473 Spearman and 0.785 NDCG@10, modestly surpassing PoET (0.470 / 0.784) and landing within about 0.004–0.006 Spearman of the larger E1 300M / 600M variants, making it a strong default for large variant panels
Structural proxy performance and deployment scalability - On unsupervised long-range contact prediction (CAMEO / CASP15, Precision@L), E1 150M in sequence-only mode achieves 0.466 / 0.387 vs. ESM-2 150M at 0.348 / 0.272, and even exceeds ESM-2 650M (0.423 / 0.342), indicating more structure-aware embeddings at lower parameter count - With homologous context, E1 150M reaches 0.510 (CAMEO) and 0.406 (CASP15), matching or exceeding MSA Pairformer on CAMEO (0.489) while remaining competitive on CASP15 (0.428), and doing so without MSA construction or row/column attention, which improves scaling to many context sequences and allows denser packing of concurrent API requests per GPU than larger E1 models or autoregressive generators
Applications¶
Zero-shot variant impact scoring for protein engineering campaigns, using E1 150M as a drop-in fitness predictor on wild-type backbones to prioritize single-site and low-order mutants before wet-lab screening, reducing library sizes and assay costs; particularly useful for enzyme activity or stability optimization and manufacturability improvements when only sequence data are available and no labeled assay data exist, but less suitable when high-accuracy task-specific supervised models trained on rich experimental datasets are already deployed
Retrieval-augmented homolog conditioning for difficult protein families where multiple sequence alignments are shallow or noisy, using E1 150M with unaligned homologs (passed as
context_sequences) to better capture family-specific constraints and coevolutionary patterns, improving ranking of functional versus non-functional variants in early-stage enzyme discovery or protein replacement therapy projects; valuable when MSAs are expensive to compute at scale, though gains are limited if few or no meaningful homologs are available in public or proprietary databasesEmbedding-based structure-aware analysis for downstream modeling, where E1 150M encoder outputs (mean or per-token embeddings from selected layers) are used as features for secondary tasks such as unsupervised contact map estimation, fold or domain classification, or as inputs to in-house docking/structure refinement pipelines, enabling teams to incorporate evolution-informed representations into structure-based design workflows without training their own large protein language models; most informative for globular, well-structured proteins and less so for highly disordered or non-natural sequences far from the training distribution
Data-efficient fitness modeling in directed evolution and high-throughput screening programs by combining E1 150M embeddings with shallow supervised models (for example, gradient-boosted trees or small neural networks) trained on limited assay data, improving generalization and hit enrichment for subsequent design rounds while avoiding the cost and complexity of fine-tuning large protein language models; particularly useful for industrial enzyme optimization under specific process conditions, though overall performance still depends on the quality, noise level, and sequence diversity of the experimental training set
Retrieval-guided exploration of protein design spaces for de novo or semi-rational design, where candidate sequences generated by in-house generative models or combinatorial libraries are filtered using E1 150M log-probability scores or masked-prediction logits, optionally conditioned on homologous
context_sequences, to prioritize designs whose implied fitness and structural plausibility remain consistent with known family constraints; less appropriate for completely novel folds or highly artificial scaffolds with no meaningful evolutionary neighbors in available sequence databases
Limitations¶
Maximum sequence length and batch size. Each query
sequenceand eachcontext_sequencesentry is limited toE1Params.max_sequence_len(= 2048) characters. Requests must contain between 1 andE1Params.batch_size(= 8)items. Very long proteins must be truncated or split, and large libraries need to be batched client-side.Context usage and type constraints. Retrieval-augmented mode is optional and limited to at most
E1Params.max_context_sequences(= 50)context_sequencesper item. ForE1EncodeRequestandE1PredictRequest, all sequences (query and context) must use the extended amino acid alphabet (ACDEFGHIKLMNPQRSTVWY + BXZUO), withE1PredictRequestadditionally allowing'?'only in the querysequenceas a mask token. ForE1PredictLogProbRequest, bothsequenceandcontext_sequencesmust use only the 20 canonical amino acids; non-canonical or masked residues are rejected. Context sequences that are too similar, non-homologous, or noisy can degrade retrieval-augmented performance instead of improving it.Embedding and logits outputs.
E1EncodeRequestParams.includecontrols which representations are returned:"mean"yields sequence-level embeddings,"per_token"yields position-wise embeddings, and"logits"returns raw per-token logits with accompanyingvocab_tokens. Only layers listed inrepr_layersare computed; requesting many layers and per-token outputs increases memory and latency. E1 is an encoder-only model and does not generate new sequences autoregressively; it is best suited for scoring, ranking, and representation, not de novo sequence generation.Fitness prediction scope and biases. While E1 150M achieves strong zero-shot performance on benchmarks like ProteinGym in both single-sequence and retrieval-augmented modes, scores are not guaranteed to correlate with experimental fitness for every protein family, selection pressure, or assay type (e.g. highly synthetic designs far from natural space, exotic environments, or complex multi-protein phenotypes). Model behavior reflects training-data biases (taxonomic and functional) and is most reliable on natural-like proteins with meaningful evolutionary context.
Structure-related limitations. E1 supports structure-related analysis only indirectly via embeddings and logits (e.g. contact-like signals); it does not output full 3D structures and is not a replacement for structure predictors like AlphaFold2 or ESMFold. Contact-style inferences are most informative for typical single-chain proteins with reasonable homolog depth and may degrade for very shallow MSAs, disordered regions, or unusual architectures. For final structural ranking or atomic-level design decisions, use dedicated structure models and treat E1 outputs as a fast proxy signal.
When E1 is not the best choice. E1 150M is optimized for scalable encoding and zero-shot scoring, not for: (1) generative sequence design (use causal or diffusion models); (2) antibody- or nanobody-specific structure modeling (specialized antibody structure models perform better on CDRs); (3) very long sequences beyond 2048 residues; or (4) downstream tasks that require joint modeling of proteins with ligands, RNAs, or complexes. In these cases, E1 embeddings can still be useful as features, but other BioLM models or pipelines are typically more appropriate.
How We Use It¶
E1 150M enables rapid retrieval-augmented embeddings and fitness-like scores for protein sequences that plug directly into end-to-end protein engineering campaigns, from virtual mutational scans and variant prioritization to structure-aware lead optimization. In practice, teams use E1 150M as a standardized scoring and representation layer inside scalable pipelines: encoder outputs feed downstream structure models and 3D metrics, zero-shot mutation scoring guides which libraries are synthesized, and sequence-level features are combined with charge, stability, and other biophysical predictors to iteratively refine enzyme and antibody designs. Because E1 150M is available through a stable, scalable API, data scientists, ML engineers, and wet-lab scientists can share a common scoring backbone across internal tools and multi-round optimization workflows, reducing model-selection overhead and accelerating time from idea to validated molecule.
Integrates with structure-based tools (e.g., contact-map–driven or AlphaFold-style analyses) and other sequence encoders to form multi-objective ranking pipelines for stability, activity, and developability.
Supports lab-in-the-loop design cycles by providing consistent embeddings and mutation scores across successive rounds of library design, synthesis, and experimental readout, enabling robust finetuning and portfolio-scale comparability.
References¶
Jain, S., Beazer, J., Ruffolo, J. A., Bhatnagar, A., & Madani, A. (2025). E1: Retrieval-Augmented Protein Encoder Models. Preprint / Technical Report. Available at https://github.com/Profluent-AI/E1.
