DeepCodon: A rare-codon–aware AI tool boosts protein expression in E. coli
Nanjing Agricultural University The Academy of Science
By combining Transformer-based sequence modeling with a novel conditional probability strategy, the approach overcomes long-standing trade-offs between maximizing expression metrics and maintaining translational features critical for proper protein folding.
Codon optimization is a central strategy in recombinant protein production, enabling genes from one organism to be efficiently expressed in another. Conventional methods typically replace low-frequency codons with host-preferred synonymous codons, which can boost expression but may also disturb translation kinetics, disrupt co-translational folding, and yield insoluble or inactive proteins. To mitigate these risks, heuristic metrics such as CAI, tAI, GC content, and mRNA folding energy have been widely used, yet they rely heavily on expert rules and struggle to capture the context-dependent complexity of codon usage. Recently, deep learning approaches have emerged, learning codon patterns directly from large datasets. Although Transformer models outperform earlier neural networks, most existing tools target eukaryotes or multi-species data, leaving E. coli—the most common bacterial host—without a dedicated Transformer-based optimizer.
A study (DOI: 10.1016/j.bidere.2025.100042) published in BioDesign Research on 12 August 2025 by Huifeng Jiang’s team, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, demonstrates that integrating deep learning with conserved rare codon preservation enables more biologically informed codon optimization, leading to more reliable and effective heterologous protein expression in E. coli.
To develop and validate a deep learning–based codon optimization framework, the researchers first constructed a large-scale protein–CDS translation model using a Transformer architecture trained on approximately 1.5 million nonredundant Enterobacteriaceae coding sequences, followed by fine-tuning with 67,860 highly expressed genes selected based on CAI, tAI, and GC content, allowing the model to implicitly learn high-expression–associated features such as codon bias, GC patterns, and conserved motifs. The resulting fine-tuned model, DeepCodon-FT, was then systematically evaluated against existing tools by optimizing 3,000 test proteins and comparing multiple metrics, including sequence identity, GC content, CAI, tAI, local %MinMax profiles, and the preservation of conserved rare codon clusters. DeepCodon-FT showed significantly higher sequence similarity to native genes, maintained synthesis-friendly GC content ranges, and achieved balanced improvements in CAI and tAI without excessive codon over-optimization that could impair protein folding. To explicitly address rare codon preservation, the researchers constructed a RareCodon dataset by identifying conserved rare codon clusters across 2 million homologous gene groups spanning nearly 100,000 species and analyzed their positional and structural distributions using AlphaFold2 predictions. Building on this analysis, DeepCodon incorporated a conditional autoregressive strategy to guide optimization while protecting these clusters. Benchmarking on the RareCodon dataset revealed that DeepCodon retained approximately 90% of conserved rare codon clusters, far exceeding other methods. Experimental validation further confirmed these advantages: when seven poorly expressed cytochrome P450 enzymes and thirteen AI-designed G3PDH enzymes were optimized, synthesized, and expressed in E. coli, DeepCodon outperformed a commercial tool in nine cases based on quantitative protein expression measurements, with most remaining cases showing comparable performance. Together, these results demonstrate that integrating large-scale training, rare codon conservation, and experimental validation enables DeepCodon to achieve robust and biologically informed codon optimization.
By combining data-driven learning with biologically informed constraints, DeepCodon offers a more nuanced approach to codon optimization. It is particularly valuable for difficult-to-express proteins, enzyme discovery, and synthetic pathway construction, where preserving translational control elements can be as important as maximizing expression levels. The availability of DeepCodon as an online tool lowers barriers to adoption and could accelerate research and industrial applications relying on E. coli expression systems.
###
References
DOI
Original Source URL
https://doi.org/10.1016/j.bidere.2025.100042
Funding information
This project has received funding from the Strategic Priority Research Program of the Chinese Academy of Sciences XDC0120200, National Natural Science Foundation of China (32371499, 12326611), COMSATS Joint Center for Industrial Biotechnology (No.TSBICIP-IJCP-001), Tianjin Synthetic Biotechnology Innovation Capacity Improvement Project](No. TSBICIP-IJCP-002, TSBICIP-CYFH-011, TSBICIP-KJGG-009-02, TSBICIP-KJGG-008-02,TSBICIP-KJGG-018, TSBICIP-PTJJ-012), and major research projects of the Haihe Laboratory of Synthetic Biology (No. 22HHSWSS00005 and 22HHSWSS00004).
About BioDesign Research
BioDesign Research is dedicated to information exchange in the interdisciplinary field of biosystems design. Its unique mission is to pave the way towards the predictable de novo design and assessment of engineered or reengineered living organisms using rational or automated methods to address global challenges in health, agriculture, and the environment.
Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.