AbLang-2 is an antibody-specific language model optimized to reduce germline bias and better model non-germline residues relevant for binding and developability. Trained on 35.6M unpaired and 1.26M paired VH/VL sequences from OAS using a modified masked language modeling objective with focal loss, it supports paired heavy/light-chain inputs up to 1024 residues per chain and batched processing of up to 32 antibodies per request. The API exposes sequence- and residue-level encodings, per-position likelihoods, and sequence restoration for masked or missing residues.
Predict¶
Predict likelihood for these input sequences
- POST /api/v3/ablang2/predict/¶
Predict endpoint for AbLang-2.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, optional) — Configuration parameters:
include (string, default: “likelihood”) — Output type to include; must be “likelihood”
items (array of objects, min: 1, max: 32) — Input antibody sequence pairs:
heavy (string, min length: 1, max length: 1024, required) — Heavy chain amino acid sequence using extended amino acid alphabet
light (string, min length: 1, max length: 1024, required) — Light chain amino acid sequence using extended amino acid alphabet
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of objects) — One result per input item, in the order requested:
likelihood (array of arrays of floats, shape: [num_sequence_tokens, vocab_size]) — Per-token logit scores over the model vocabulary
sequence_tokens (array of strings, length: num_sequence_tokens) — Tokenized heavy and light input sequences, including special and separator tokens
vocab_tokens (array of strings, length: vocab_size) — Vocabulary tokens corresponding to columns in
likelihood
Example response:
Encode¶
Generate embeddings for input sequences
- POST /api/v3/ablang2/encode/¶
Encode endpoint for AbLang-2.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, optional) — Configuration parameters:
include (string, optional, default: “seqcoding”, enum: [“seqcoding”, “rescoding”]) — Encoding type to compute for each item
align (boolean, optional, default: false) — Alignment flag used when include is “rescoding”
items (array of objects, min: 1, max: 32) — Input antibody sequence pairs:
heavy (string, required, min length: 1, max length: 1024) — Heavy chain amino acid sequence using the extended amino acid alphabet
light (string, required, min length: 1, max length: 1024) — Light chain amino acid sequence using the extended amino acid alphabet
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of objects) — One result per input item, in the order requested:
seqcoding (array of floats, size: 480) — Fixed-size embedding for the concatenated heavy and light chains
Example response:
Performance¶
Runs on 2 vCPUs and 4 GB RAM with CPU‑only inference; AbLang‑2 deployments do not use GPUs, making them significantly cheaper to operate than BioLM’s larger transformer antibody models (e.g. IgT5 Paired, IgBert Paired)
AbLang‑2 uses a compact ESM‑2–derived architecture (12 layers, 480‑dimensional embeddings) and an optimized inference pipeline, giving lower latency for seqcoding and rescoding than larger antibody encoders at BioLM (e.g. IgT5, IgBert)
Likelihood estimation and sequence restoration complete in milliseconds per VH/VL pair under typical loads, providing substantially lower end‑to‑end latency for design loops than heavier antibody language models that require GPU‑backed inference
Relative to other antibody‑focused transformers at BioLM, AbLang‑2 delivers markedly better computational efficiency while maintaining competitive embedding quality and superior non‑germline mutation modeling accuracy for antibody design and optimization tasks
Applications¶
Antibody sequence optimization to prioritize non‑germline mutations that are frequently observed in affinity‑matured and therapeutic antibodies, using per‑residue likelihoods from the likelihood endpoint to highlight residues where mutations away from germline remain probable; useful for proposing sequence changes beyond simple germline reversion, but not a substitute for explicit binding or developability prediction models.
In silico ranking of mutated antibody panels by comparing per‑residue likelihood profiles across VH/VL variants that differ at a limited number of positions (typically outside CDR3), enabling triage of alternative lead designs before experimental affinity maturation; less informative for variants involving major insertions/deletions, large structural rearrangements, or sequences far outside natural antibody space.
Computational assessment of somatic hypermutation patterns within clonotypes by embedding paired VH/VL sequences with the encoder (seqcoding/rescoding) endpoints and comparing non‑germline‑like positions, supporting analysis of maturation trajectories and mutational hotspots in discovery or patient repertoires; not intended as a standalone phylogenetic or lineage reconstruction tool.
Completion and restoration of partially known antibody variable regions by supplying “*” placeholders in VH/VL sequences to the restore/generate endpoint, enabling reconstruction of paired sequences from repertoire or display data with internal gaps but preserved framework context; performance degrades for highly fragmented inputs, unusual numbering, or non‑canonical antibodies.
Rapid repertoire‑scale embedding of paired antibodies for clustering and diversity analyses in discovery campaigns, using encoder outputs (seqcoding for sequence‑level encodings or rescoding for residue‑level encodings) to group similar VH/VL pairs and quantify diversification from germline; suitable for selecting representative leads from large BCR‑seq datasets, but not as a standalone predictor of aggregation, immunogenicity, or manufacturability.
Limitations¶
Maximum Sequence Length: Each
heavyandlightchain must be between 1 and1024amino acids. Longer chains must be truncated or split before submission.Batch Size: Each request can include between 1 and
32antibody pairs initems. Larger datasets should be split across multiple API calls.Antibody-Specific Embeddings: The
seqcoding(per-sequence embedding) andrescoding(per-residue embedding, optionally aligned whenalign=True) outputs are trained on antibody VH/VL sequences and may not generalize to non-antibody proteins or unrelated biological sequences.Restore Requirements: The
restore/generatefunctionality only accepts sequences where at least one position inheavyorlightis marked as*and all other characters are valid amino acids; sequences without*or with other invalid tokens cannot use this feature.No Structural Predictions: AbLang-2 does not output 3D structural information (e.g. full antibody structures or CDR loop conformations). Use dedicated structure prediction tools for such tasks.
No Binding or Epitope Predictions: Likelihood scores (from
likelihood/predict) reflect sequence plausibility under the model and are not direct measures of antigen binding affinity, specificity, developability, or epitope/paratope location; these tasks require additional models or experimental validation.
How We Use It¶
BioLM uses AbLang-2 as a component in antibody design cycles to score VH–VL sequence pairs with reduced germline bias, highlight non-germline positions that are still plausible, and provide alternative residue suggestions that are more likely to yield productive affinity or liability shifts without destabilizing the scaffold. AbLang-2 sequence- and residue-level encodings, along with per-position likelihoods exposed via standardized, scalable APIs, are combined with structural modeling, developability screens, and experimental data to focus campaigns on non-germline modifications that preserve overall antibody integrity.
Supports in silico affinity maturation by ranking mutations outside CDR3 using sequence likelihoods
Integrates with structure- and biophysics-based tools for multi-parameter antibody optimization across large variant libraries
References¶
Olsen, T. H., Moal, I. H., & Deane, C. M. (2024). Addressing the antibody germline bias and its effect on language models for improved antibody design. bioRxiv.
Olsen, T. H., Moal, I. H., & Deane, C. M. (2022). AbLang: an antibody language model for completing antibody sequences. Bioinformatics Advances.
Olsen, T. H., Boyles, F., & Deane, C. M. (2022). Observed Antibody Space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Science.
Raybould, M. I. J., Marks, C., Lewis, A. P., Shi, J., Bujotzek, A., Taddese, B., & Deane, C. M. (2020). Thera-SAbDab: the Therapeutic Structural Antibody Database. Nucleic Acids Research.
Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y., dos Santos Costa, A., Fazel-Zarandi, M., Sercu, T., Candido, S., & Rives, A. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science.
Ruffolo, J. A., Gray, J. J., & Sulam, J. (2021). Deciphering antibody affinity maturation with language models and weakly supervised learning. arXiv preprint.
Prihoda, D., Maamary, J., Waight, A., Juan, V., Fayadat-Dilman, L., Svozil, D., & Bitton, D. A. (2022). BioPhi: A platform for antibody design, humanization, and humanness evaluation based on natural antibody repertoires and deep learning. mAbs.
Rives, A., Meier, J., Sercu, T., Goyal, S., Lin, Z., Liu, J., Guo, D., Ott, M., Zitnick, C. L., Ma, J., & Fergus, R. (2021). Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint.
