E1 300M is a 300M-parameter retrieval-augmented protein encoder that generates sequence embeddings and zero-shot mutation scores from amino acid input, optionally conditioned on up to 50 unaligned homologous context sequences (each up to 2048 residues). It uses bidirectional Transformer layers with alternating intra-sequence and block-causal multi-sequence attention, trained on ~4T tokens. The GPU-accelerated service supports batched inference (up to 8 items) for variant ranking, fitness scoring, and embeddings for downstream structural and protein engineering models.
Predict¶
Perform masked amino acid prediction for positions marked with ‘?’ in the query sequence, optionally conditioned on homologous context sequences.
- POST /api/v3/e1-300m/predict/¶
Predict endpoint for E1 300M.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, optional) — Configuration parameters:
repr_layers (array of integers, default: [-1]) — Indices of encoder layers to return representations from
include (array of strings, default: [“mean”]) — Representation types to compute; allowed values: “mean”, “per_token”, “logits”
items (array of objects, min: 1, max: 8, required) — Input sequences:
sequence (string, min length: 1, max length: 2048, required) — Input protein sequence using extended amino acid alphabet (ACDEFGHIKLMNPQRSTVWYBXZUO)
context_sequences (array of strings, max items: 50, optional) — Context protein sequences, each with min length: 1, max length: 2048, using extended amino acid alphabet (ACDEFGHIKLMNPQRSTVWYBXZUO)
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of objects) — One result per input item, in the order requested:
logits (array of arrays of floats) — Per-position unnormalized scores over the amino acid vocabulary, shape:
[L, V]whereLis the length of the inputsequenceandVis the vocabulary sizesequence_tokens (array of strings) — Tokens for each position in the input
sequence, shape:[L]; each element is a single-character amino acid or'?'mask tokenvocab_tokens (array of strings) — Vocabulary for the logits dimension, shape:
[V]; each element is a single-character amino acid token corresponding to the second dimension oflogits
Example response:
Encode¶
Encode protein sequences with optional homologous context using the E1 300M model, returning mean, per-token, and logits representations from specified layers.
- POST /api/v3/e1-300m/encode/¶
Encode endpoint for E1 300M.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, optional) — Configuration parameters:
repr_layers (array of integers, default: [-1]) — Layer indices to include in the output representations
include (array of strings, default: [“mean”]) — Output components to return; allowed values: [“mean”, “per_token”, “logits”]
items (array of objects, min: 1, max: 8, required) — Input sequences:
sequence (string, min length: 1, max length: 2048, required) — Query amino acid sequence using extended alphabet (ACDEFGHIKLMNPQRSTVWYBXZUO)
context_sequences (array of strings, max items: 50, optional) — Optional homologous amino acid sequences using extended alphabet (ACDEFGHIKLMNPQRSTVWYBXZUO), each with min length: 1, max length: 2048
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of objects) — One result per input item, in the order requested:
embeddings (array of objects, optional) — Layer-level pooled embeddings, present when
"mean"is included ininclude:layer (int) — Layer index matching values in
repr_layersembedding (array of floats) — Pooled embedding vector for the query sequence; length equals the model hidden size (model-dependent)
per_token_embeddings (array of objects, optional) — Layer-level per-token embeddings, present when
"per_token"is included ininclude:layer (int) — Layer index matching values in
repr_layersembeddings (array of arrays of floats) — Per-token embedding vectors for the query sequence; shape
[L, H]whereLis the query sequence length (excluding context) andHis the model hidden size (model-dependent)
logits (array of arrays of floats, optional) — Unnormalized scores over the model vocabulary for each token position in the query sequence; shape
[L, V]whereLis the query sequence length (excluding context) andVislen(vocab_tokens); values are real-valued and unboundedvocab_tokens (array of strings, optional) — Vocabulary tokens corresponding to the last dimension of
logits; lengthVmatcheslogits[*][*]sizecontext_sequence_count (int, optional) — Number of context sequences used during encoding (0–50)
Example response:
Performance¶
Retrieval-augmented vs single-sequence accuracy - In sequence-only mode, E1 300M matches or slightly exceeds ESM-2 650M and ESM C 300M on ProteinGym zero-shot fitness prediction, while using fewer parameters than ESM-2 3B. - With homologous context sequences provided via
context_sequences, E1 300M improves ProteinGym average Spearman from ~0.42 to ~0.48 and NDCG@10 from ~0.75 to ~0.79, surpassing PoET and MSA Pairformer at comparable or larger scales in the same retrieval-augmented setting. - Relative to other E1 sizes, E1 300M offers a moderate but consistent gain over E1 150M (~0.01–0.02 Spearman/NDCG) and is typically within a few thousandths of E1 600M, giving a favorable trade-off between accuracy and compute within the E1 family.Structural proxy performance (contact maps) - For unsupervised long-range contact prediction on CAMEO and CASP15 using architecture-agnostic Categorical Jacobians, E1 300M in single-sequence mode outperforms ESM-2 650M and ESM C 300M, with absolute precision@L gains of ~0.05–0.07. - When homologous context sequences are supplied, E1 300M further increases contact precision while remaining substantially cheaper than running full 3D structure predictors such as AlphaFold2 or ESMFold, making it practical for large-scale screening. - For workflows where full atomic models are unnecessary, E1 300M embeddings and contact proxies often yield better throughput–accuracy trade-offs than structure generators (AlphaFold2, ESMFold, AntiFold, Chai-1) for ranking, clustering, and filtering.
Comparison to other BioLM sequence encoders - Versus ESM-2 150M / 650M, E1 300M delivers higher zero-shot fitness and contact-map performance at similar or smaller parameter counts than ESM-2 650M, and is generally more latency-efficient than ESM-2 3B for mutation scoring and zero-shot design. - Versus ESM C 300M / 600M, E1 300M is competitive or superior on ProteinGym and contact benchmarks and gains additional accuracy when
context_sequencesare provided, especially for low- to medium-depth families where ESM C cannot exploit retrieval. - Versus MSA Transformer and MSA Pairformer, E1 300M uses unaligned context sequences with block-causal multi-sequence attention, avoiding MSA construction; combined with reuse of retrieved homologs across many variants, this typically yields faster end-to-end pipelines.Deployment and scaling characteristics - At 300M parameters, the model maintains strong accuracy while leaving headroom for batched and retrieval-augmented inference on modern data center GPUs, yielding higher throughput than larger encoders (E1 600M, ESM-2 3B) at only slightly reduced accuracy. - Compared with autoregressive generative models (e.g., ProGen2, Evo-series) used for full de novo design, E1 300M behaves as a scoring and embedding engine, processing many more sequences per GPU-second because it does not generate tokens sequentially. - In iterative optimization campaigns (e.g., enzyme or antibody maturation), accuracy can be improved by adding or updating
context_sequenceswithout retraining or fine-tuning, simplifying production deployment compared to fine-tuned models that must be revalidated per target.
Applications¶
Zero-shot fitness scoring and variant ranking for protein engineering campaigns, using E1 300M log-likelihoods (via
predict_log_prob) to prioritize mutations with higher predicted functional retention or improvement before wet-lab screening; useful for industrial enzymes (e.g., stability, solvent tolerance, temperature) and other proteins where large mutagenesis libraries are costly, and works best for local substitutions around a known wild-type rather than de novo sequence spaces far from natural proteins.Retrieval-augmented lead optimization within protein families by supplying homologous context sequences in
encoderorpredictrequests, allowing E1 300M to better respect family-specific co-evolution when ranking hits from directed evolution or display selections, so teams can push leads toward improved potency, specificity, or stability with fewer experimental cycles; most effective when a moderate-to-deep set of homologs is available and less informative for highly novel or orphan families with sparse sequence data.Structure-aware filtering of designed proteins by deriving unsupervised long-range contact signals from E1 300M per-token embeddings (via
encoderwithper_tokenoutputs) and downstream Categorical Jacobian–style analyses to reject designs unlikely to form consistent 3D contact patterns, focusing costly cryo-EM, NMR, crystallography, or SAXS on candidates with more plausible folds; suitable as a fast pre-filter but not a substitute for dedicated structure predictors when atomic accuracy is required.Developability risk assessment for therapeutic and industrial proteins by extracting E1 300M embeddings and log-probability scores along the sequence to flag positions where proposed mutations are strongly disfavored relative to evolutionary context, informing which variants proceed into expression, formulation, or stability testing; particularly helpful when integrated with in-house assays in an active-learning loop, but not a standalone predictor of issues like immunogenicity or manufacturability, which still require specialized models and empirical data.
Sequence space navigation and library design within known protein families by using E1 300M to score or embed large in silico variant sets (from generative models or combinatorial designs) and retain only sequences that remain in a high-likelihood, evolutionarily consistent manifold conditioned on homologs, thereby shrinking library size while maintaining functional diversity; best suited to focused libraries around an existing scaffold rather than very broad exploratory searches that intentionally depart far from observed natural sequences.
Limitations¶
Maximum sequence length and batch size. Each query
sequenceand eachcontext_sequencesentry is limited to 2048 amino acids (E1Params.max_sequence_len). Requests to/encode,/predict, and/predict_log_probcan include at most 8 items per call (E1Params.batch_size). Very long proteins must be truncated or tiled, which can break biological context and affect predictions.Context sequence limits and retrieval quality. Retrieval-augmented mode is constrained to at most 50 context sequences per item (
E1Params.max_context_sequences), each obeying the same 2048-residue limit. The API does not perform homolog search: ifcontext_sequencesare sparse, non-homologous, or heavily biased, performance gains may be minimal or worse than single-sequence mode.Alphabet and masking constraints.
E1EncodeRequestItem.sequenceand itscontext_sequencesaccept the extended amino acid alphabet (ACDEFGHIKLMNPQRSTVWY + BXZUO).E1PredictRequestItem.sequencemust include at least one?mask, and only this query sequence may contain?; itscontext_sequencesmust not contain?and must use only the extended alphabet.E1PredictLogProbRequestItem.sequenceand itscontext_sequencesare stricter and must use only canonical amino acids viavalidate_aa_unambiguous. Sequences with unsupported characters or invalid masking are rejected.Embedding and output options trade-offs.
E1EncodeRequestParams.includecontrols which representations are returned:"mean"yields per-sequence embeddings,"per_token"yields per-residue embeddings, and"logits"yields raw token scores withvocab_tokens. Requesting multiple options increases response size and latency;"per_token"outputs for long sequences or large batches can be expensive to transmit and store. E1 is an encoder-only model exposed viaencoderandpredictorendpoints and does not perform autoregressive sequence generation.Scientific and task limitations. E1 300M is trained as a masked language model on natural protein sequences and is most reliable for fitness prediction, variant ranking, and structural proxy tasks such as contact-map–like signals. It is not a full 3D structure predictor (no AlphaFold2-like atomic models), not a general-purpose diffusion or causal generative designer, and not tuned for non-protein polymers. Performance can degrade for highly novel folds, chimeric constructs, or synthetic libraries far from natural evolution, especially when informative homologs are unavailable.
When another model may be preferable. For tasks requiring atomic-resolution structures, explicit antibody/CDR3 structure modeling, long-context conditional generation, or embeddings jointly trained with experimental structure/functional labels, specialized models (e.g., fold predictors, antibody-focused encoders, diffusion or causal LMs) are often more appropriate. E1 300M is a good default for encoding and scoring natural-like proteins, but for large generative design campaigns, rapid fold-ranking, or antibody-centric workflows, consider combining E1 with or replacing it by more task-specific models.
How We Use It¶
E1 300M enables reliable zero-shot fitness estimation and structurally informed encodings across large protein portfolios, which we plug into end‑to‑end workflows for enzyme engineering, antibody optimization, and iterative sequence maturation. Its retrieval‑augmented embeddings, accessed through standardized, scalable APIs, integrate with structure predictors, generative sequence models, and biophysical property predictors (e.g., stability, charge, size) so data science and wet‑lab teams can prioritize variants, target libraries around promising regions of sequence space, and systematically incorporate assay readouts into downstream ML models such as family‑specific regressors, active‑learning loops, and multi‑objective rankers.
Used for zero‑shot ranking and triage of large variant libraries before synthesis, then refined with assay data to guide subsequent design rounds.
Combined with structural and biophysical predictors to build multi‑parameter filters (activity, developability, manufacturability) for antibody and enzyme candidates.
References¶
Jain, S., Beazer, J., Ruffolo, J. A., Bhatnagar, A., & Madani, A. (2025). E1: Retrieval-Augmented Protein Encoder Models. Profluent Bio Technical Report / Preprint. Available at: https://github.com/Profluent-AI/E1
