DNA Chisel is a Python-based DNA sequence analysis and optimization framework for multi-objective synthetic biology design problems. In this API, it is used as a fast, CPU-only predictor of sequence-level design features, including GC content and GC-content variance, codon adaptation index (CAI), codon usage entropy, rare-codon and methionine frequencies, melting temperature, hairpin score, homopolymer runs, dinucleotide frequencies, motif counts (restriction sites, TATA boxes, tandem repeats, non-unique 6-mers, in-frame stops), nucleotide skew/entropy, and Kozak sequence strength for host-specific expression and manufacturability assessment.
Predict¶
Predict DNA design features for a single sequence
- POST /api/v3/dna-chisel/predict/¶
Predict endpoint for DNA Chisel.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, optional) — Configuration parameters:
include (array of strings, default: [“gc_content”, “cai”, “hairpin_score”, “melting_temperature”, “restriction_site_count”, “codon_usage_entropy”, “rare_codon_frequency”, “homopolymer_run_length”, “dinucleotide_frequencies”, “sequence_length”, “tata_box_count”, “non_unique_6mer_count”, “in_frame_stop_codon_count”, “methionine_frequency”, “at_skew”, “gc_skew”, “nucleotide_entropy”, “tandem_repeat_count”, “gc_content_std_dev”, “kozak_sequence_strength”]) — Feature keys to include in the response; each value must be one of the supported feature option strings
species (string, default: “e_coli”) — Species identifier for codon-related features; one of: “e_coli”, “s_cerevisiae”, “h_sapiens”, “c_elegans”, “b_subtilis”, “d_melanogaster”
restriction_enzymes (array of strings, default: [“EcoRI”, “BsaI”], optional) — Restriction enzyme names for site-count features; set to an empty array or null to disable; each value must be a valid enzyme name from the Biopython restriction enzyme database
items (array of objects, min: 1, max: 1) — Input sequences:
sequence (string, min length: 1, required) — DNA sequence containing only unambiguous nucleotides (A, C, G, T)
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of objects) — One result per input item, in the order requested:
gc_content (float, optional, range: 0.0–1.0) — Fractional GC content of the sequence
cai (float, optional, range: 0.0–1.0) — Codon Adaptation Index relative to the selected species
hairpin_score (float, optional, ≥ 0.0) — Predicted hairpin formation score
melting_temperature (float, optional, units: °C) — Melting temperature of the sequence
restriction_site_count (object, optional) — Counts of restriction enzyme recognition sites:
{enzyme_name} (int, ≥ 0) — Count for a single restriction enzyme
codon_usage_entropy (float, optional, ≥ 0.0) — Shannon entropy of codon usage distribution
rare_codon_frequency (float, optional, range: 0.0–1.0) — Fractional frequency of rare codons relative to the selected species
homopolymer_run_length (int, optional, ≥ 1) — Length of the longest homopolymer run
dinucleotide_frequencies (object, optional) — Frequencies of all 16 dinucleotide combinations:
{dinucleotide} (float, range: 0.0–1.0) — Fractional frequency of a single dinucleotide
sequence_length (int, optional, ≥ 1) — Sequence length in nucleotides
tata_box_count (int, optional, ≥ 0) — Count of TATA box motifs
non_unique_6mer_count (int, optional, ≥ 0) — Count of 6-mer subsequences occurring more than once
in_frame_stop_codon_count (int, optional, ≥ 0) — Count of in-frame stop codons
methionine_frequency (float, optional, range: 0.0–1.0) — Fractional frequency of methionine codons (ATG)
at_skew (float, optional, range: -1.0–1.0) — AT skew (A−T)/(A+T)
gc_skew (float, optional, range: -1.0–1.0) — GC skew (G−C)/(G+C)
nucleotide_entropy (float, optional, ≥ 0.0) — Shannon entropy of nucleotide distribution
tandem_repeat_count (int, optional, ≥ 0) — Count of tandem repeat sequences
gc_content_std_dev (float, optional, ≥ 0.0) — Standard deviation of GC content across internal sliding windows
kozak_sequence_strength (float, optional, range: 0.0–1.0) — Strength score of the Kozak consensus sequence
Example response:
Performance¶
Runs on CPU-only resources (0.25 vCPU, 1 GB RAM); no GPU is used or required because all features are computed with lightweight, local sequence operations.
Uses DNA Chisel’s constraint-first, objective-second local optimization heuristics, which converge faster on comparable DNA design problems than generic genetic algorithms and earlier multi-objective tools such as D-Tailor.
For codon- and restriction-related features, performance is largely species-independent (e.g., E. coli, S. cerevisiae, H. sapiens), as codon usage tables and restriction enzyme definitions are cached in memory instead of recomputed per request.
Within BioLM’s DNA tooling, DNA Chisel is optimized for rapid local DNA feature evaluation and constraint handling, making it substantially faster and cheaper for DNA design analytics than heavier protein-structure or genome-scale models that require GPU acceleration.
Applications¶
Codon optimization assessment for heterologous protein expression using CAI, rare-codon frequency, codon-usage entropy and GC content features to quantify how well a coding sequence matches host-specific usage in supported species such as E. coli, yeast or human; useful for biopharma and industrial protein production teams when triaging candidate designs prior to wet-lab testing; not a full de novo sequence optimizer or predictor of expression level or folding.
Detection of problematic motifs in synthetic constructs via restriction site counts, homopolymer run length, tandem repeat count and non-unique 6-mer metrics, helping gene synthesis providers and synthetic biology companies identify sequences likely to cause synthesis failures or undesired recombination; not intended for designing cloning strategies or genome-scale editing plans.
Evaluation of DNA manufacturability and basic stability constraints for viral vectors and plasmids by combining GC content, GC/AT skew, GC content standard deviation and melting temperature features, enabling gene therapy and vector engineering groups to flag designs that may be hard to synthesize or amplify; less suitable for modeling in vivo vector performance or immunogenicity.
Screening of genetic circuit components and “neutral” sequence spacers using hairpin score, TATA box count, Kozak sequence strength and in-frame stop codon count to reduce unintended transcriptional starts, translation initiation or premature stops; valuable for metabolic engineering and synthetic biology teams designing predictable multi-gene constructs; not designed for detailed RNA secondary structure prediction.
Programmatic quality control of large DNA libraries and variant panels by computing sequence length, dinucleotide frequencies, nucleotide entropy and methionine frequency across many constructs, ensuring diversity while staying within experimental constraints for high-throughput screening; not a replacement for specialized antibody or protein structure-based design tools.
Limitations¶
Batch Size: The predictor endpoint processes exactly one sequence per request (
DnaChiselParams.batch_size = 1); sending multipleitemsin a singleDnaChiselPredictRequestis not supported.Minimum Sequence Length: Each
items[].sequencemust contain at least 1 unambiguous DNA nucleotide (A, C, G, T); ambiguous bases are rejected byvalidate_dna_unambiguous. Very long sequences are accepted but will increase runtime and may be impractical to analyze at scale.Feature Scope Only: This API reports predefined sequence descriptors in
params.include(for examplegc_content,cai,melting_temperature) and never altersitems[].sequence. It does not expose DNA Chisel’s sequence optimization or design capabilities described in the original publication.Codon and Species Limitations: Codon-related outputs such as
cai,codon_usage_entropyandrare_codon_frequencyare computed only forparams.speciesvalues inSupportedSpecies(e_coli,s_cerevisiae,h_sapiens,c_elegans,b_subtilis,d_melanogaster). Using other organisms requires mapping to one of these codon tables and may not reflect true in vivo behavior.Enzyme and Motif Coverage:
restriction_site_countis limited to enzymes returned bylist_supported_restriction_enzymes; any unsupported name inparams.restriction_enzymesraises a validation error. Motif-derived counts such astata_box_countandtandem_repeat_countrely on simple pattern-based heuristics and are not substitutes for full promoter, regulatory-element, or repeat-annotation pipelines.No Higher-Level Functional Predictions: Outputs like
gc_content,hairpin_score,kozak_sequence_strength, ortandem_repeat_countare low-level sequence metrics only. The API does not predict protein structure, expression levels, folding, activity, stability, or other cellular or organismal phenotypes, and is not suitable as a standalone design or validation tool for complex genetic constructs.
How We Use It¶
DNA Chisel enables us to score DNA sequences for manufacturability and basic design risks before synthesis, providing standardized sequence-quality metrics that feed into protein engineering and lab-in-the-loop optimization. By turning GC content, codon usage statistics, restriction site counts, sequence complexity, and related features into structured outputs, we link DNA Chisel analyses with sequence-level predictors, structure-based property models, and generative design loops to de-risk synthesis and focus wet-lab effort on candidates with robust DNA-level properties.
Integrates with predictive models and embeddings to rank or filter designs using DNA designability metrics alongside protein-level scores.
Supports multi-round optimization by supplying consistent, synthesis-aware sequence features for each candidate across design cycles.
References¶
Zulkower, V., & Rosser, S. (2020). DNA Chisel, a versatile sequence optimizer. Bioinformatics, 36(16), 4508–4509. https://doi.org/10.1093/bioinformatics/btaa558
