CLEAN is a GPU-accelerated enzyme function prediction service that assigns EC numbers from amino acid sequences using contrastive learning over ESM-1b embeddings refined into a 128-dimensional, function-aware space. The API provides two endpoints: a predictor that returns ranked EC predictions with distance-based and GMM-derived confidence scores (configurable top-N and minimum confidence), and an encoder that outputs CLEAN embeddings. Typical applications include large-scale genome and metagenome annotation, pathway design, and enzyme engineering.
Predict¶
Predict up to 5 EC numbers per sequence with at least 0.2 confidence
- POST /api/v3/clean/predict/¶
Predict endpoint for CLEAN.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, optional) — Configuration parameters:
max_predictions (int, range: 1-20, default: 10) — Maximum number of EC predictions to return per sequence
min_confidence (float, range: 0.0-1.0, default: 0.05) — Minimum confidence score required to include a prediction in the results
items (array of objects, min: 1, max: 10) — Input sequences:
sequence (string, min length: 1, max length: 1022, required) — Amino acid sequence using the extended allowed alphabet
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of objects) — One result per input item, in the order requested:
predictions (array of objects) — Predicted EC numbers for the input sequence, ordered by distance (closest first)
ec_number (string) — Predicted EC number (e.g., “3.5.2.6”)
distance (float) — Euclidean distance to the corresponding EC cluster center, unitless
confidence (float, range: 0.0-1.0) — GMM-based confidence score, unitless
Example response:
Encode¶
Generate 128-dimensional CLEAN embeddings for two enzyme sequences
- POST /api/v3/clean/encode/¶
Encode endpoint for CLEAN.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
items (array of objects, min: 1, max: 10, required) — Input sequences:
sequence (string, min length: 1, max length: 1022, required) — Amino acid sequence using extended valid residue codes
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of objects) — One result per input item, in the order requested:
embedding (array of floats, size: 128) — 128-dimensional CLEAN embedding for the input sequence
Example response:
Performance¶
Predictive accuracy on held-out, low-identity data: on UniProt-derived tests where all sequences share <50% identity with training data, CLEAN reaches F1 ≈ 0.865 using its maximum-separation EC selection; even at a 10% identity threshold, F1 ≈ 0.67, indicating robustness beyond homology-based transfer
Benchmark vs. other EC predictors: on the New-392 dataset (392 enzymes, 177 ECs, post-training Swiss-Prot release), CLEAN attains F1 ≈ 0.499 (precision ≈ 0.597, recall ≈ 0.481), outperforming ProteInfer (F1 ≈ 0.309) and DeepEC (F1 ≈ 0.230) under comparable evaluation; on the challenging misannotation dataset Price-149, CLEAN reaches F1 ≈ 0.495 vs. ProteInfer ≈ 0.166 and DeepEC ≈ 0.085
Performance on rare and understudied EC numbers: on a curated validation set where each EC appears ≤5 times in training (>3000 samples, >1000 ECs), CLEAN achieves F1 ≈ 0.817, exceeding ProteInfer and DeepEC even though those models were originally trained on this set; EC-level precision and recall remain high across frequency bins, with substantially less bias toward common ECs than multilabel classification approaches
Comparison to other BioLM models and similarity tools:
Versus general protein encoders (e.g., ESM-1b/ESM-2 available on BioLM), CLEAN’s contrastive training produces 128D embeddings where Euclidean distance tracks EC functional similarity, yielding markedly higher EC F1 than using raw ESM embeddings with simple classifiers or kNN
Versus sequence-similarity tools (e.g., DIAMOND on BioLM), CLEAN maintains higher precision/recall for remote homologs (<50% and down to <10% identity), promiscuous enzymes, and historically misannotated proteins, while DIAMOND can remain preferable for very high-identity, bulk annotation workloads
Applications¶
High-confidence EC annotation for novel and low-homology enzyme leads in discovery pipelines, enabling teams to move from metagenomic or proprietary sequencing data to actionable functional hypotheses when BLASTp and homology tools fail or disagree; CLEAN’s contrastive-learning embeddings and EC-cluster distance rankings are optimized to recover plausible EC numbers for understudied and rare functions, but they are not a replacement for wet-lab validation where regulatory or IP decisions depend on definitive mechanism-of-action data.
Automated functional triage of large enzyme libraries (e.g., >10⁴–10⁶ variants) in directed evolution or semi-rational design campaigns, using CLEAN’s ranked EC predictions (with per-EC confidence scores and distance metrics) to filter, cluster, and prioritize sequences likely to retain desired catalytic activity or exhibit related transformations, thereby reducing experimental screening load; this is most effective when variants remain within a reasonable sequence distance of known enzymes and less suited for completely de novo, uncharacterizable folds.
Detection and curation of misannotated or ambiguous enzymes in proprietary or public sequence collections, where CLEAN’s EC predictions can be compared against existing labels to highlight contradictions and potential corrections, helping companies clean internal knowledgebases, avoid propagating legacy database errors into downstream models, and flag candidates for targeted re-characterization rather than relying solely on automated homology-based annotations.
Identification of candidate promiscuous enzymes with multiple EC activities to expand biocatalyst portfolios, by scanning sequence collections and using CLEAN to propose additional EC numbers with high confidence for single proteins; this supports discovery of versatile catalysts for multi-step chemoenzymatic processes and pathway shortcuts, while still requiring follow-up kinetic and substrate-scope measurements to quantify real-world utility and to downweight low-confidence secondary predictions.
Enzyme selection for pathway design and metabolic engineering workflows, where CLEAN’s EC-aware embeddings (from the encoder endpoint) and ranked EC assignments (from the predictor endpoint) help choose plausible biocatalysts for missing steps in synthetic routes or retrobiosynthesis plans when no close homologs are known, allowing teams to assemble more complete in silico pathways from metagenomic or custom sequence datasets, while recognizing that cofactor specificity, expression feasibility, and process conditions must be evaluated with additional models or experiments beyond EC prediction alone.
Limitations¶
Maximum sequence length: Each
sequencemust be a valid amino-acid string of length1–1022characters. Longer inputs must be truncated or pre-processed before callingclean; the model is not aware of truncation and predictions on truncated sequences may lose critical functional context (for example, missing domains or termini).Batch size and throughput: Each request can contain at most
10items initemsfor bothCLEANPredictRequestandCLEANEncodeRequest. Large datasets should be split into multiple batched calls; there is no cross-batch state, so users must handle any required aggregation (e.g., consensus across isoforms) themselves.Prediction scope and confidence: The model only predicts EC numbers that exist in its training set; genuinely novel enzyme functions or EC classes absent or extremely rare in UniProt may be mapped to the closest known EC and appear confident. The
max_predictions(1–20, default10) andmin_confidence(0.0–1.0, default0.05) fields inCLEANPredictRequestParamsonly filter which entries are returned inpredictions—they do not change the underlying model behavior. Users should treat low-confidence predictions (smallconfidencevalues or many ECs returned with similardistance) as hypotheses, not annotations.Not a general-purpose protein predictor: CLEAN is specialized for enzyme EC annotation. It is not optimal for tasks such as protein stability prediction, binding affinity estimation, non-enzymatic functional annotation, or structure prediction; in those cases use task-specific models and, if needed, combine them with CLEAN predictions as one feature among many.
Embedding usage constraints:
CLEANEncodeResponsereturns a single 128-dimensionalembeddingpersequence, optimized for EC-related functional similarity. These embeddings are not guaranteed to reflect other properties (e.g., structure, localization, or expression) and may not be ideal as general-purpose protein embeddings. For applications like global protein clustering, visualization, or multitask models beyond enzyme function, consider combining CLEAN embeddings with broader protein language model embeddings.Label noise and multi-function enzymes: Because CLEAN is trained on curated but imperfect database annotations, systematic errors or inconsistencies in source EC labels can propagate into predictions. While the contrastive framework can help detect promiscuous enzymes and correct some mislabeled cases, the model may still miss rare secondary activities or over-call promiscuity. Experimental validation remains essential before using predicted EC numbers (from
predictions[*].ec_number) in high-impact design decisions.
How We Use It¶
CLEAN enables EC-aware enzyme function annotation that integrates directly into BioLM’s protein design workflows, accelerating the path from raw sequence collections to prioritized, experimentally testable candidates. Through scalable, standardized APIs, teams combine CLEAN’s function-aware embeddings and EC predictions with generative design models, structure-based scoring, and developability filters to triage metagenomic or proprietary libraries, refine legacy annotations, and surface understudied or promiscuous activities that unlock new biocatalysis and pathway engineering strategies.
Used with BioLM’s design and stability/biophysics models, CLEAN helps define objective functions and screening criteria so wet-lab work focuses on variants with both desirable activities and manageable liabilities.
In iterative lab-in-the-loop campaigns, CLEAN’s embeddings and multi-EC outputs support multi-round optimization and faster convergence on enzymes suited for metabolic engineering, process development, and discovery programs.
References¶
Yu, T., Cui, H., Li, J. C., Luo, Y., Jiang, G., & Zhao, H. (2023). Enzyme function prediction using contrastive learning. Science, 379(6639), 1358–1363.
Yu, T., et al. (2023). Enzyme function prediction using contrastive learning (code and data, version 1.0.0). Zenodo.
