Alexey Khoroshilov, Alexey Chernysh, Orkhan Ekhtibarov, Nini Kamkia, Dmitry Zmitrovich
Lime
QuantCode-Bench evaluates modern LLMs on their ability to generate executable algorithmic trading strategies for the Backtrader framework from natural-language descriptions. The benchmark contains 400 tasks of varying difficulty collected from Reddit, TradingView, StackExchange, GitHub, and synthetic sources. Models are evaluated through a four-stage pipeline in both single-turn and agentic multi-turn settings.
A strategy is counted as successful only if it passes all four stages sequentially.
400 trading-strategy generation tasks from diverse sources and difficulty levels.
| Source | Easy | Medium | Hard | Total |
|---|---|---|---|---|
| 147 | 16 | 20 | 183 | |
| TradingView | 6 | 57 | 37 | 100 |
| StackExchange | 32 | 34 | 24 | 90 |
| GitHub | 12 | 1 | 6 | 19 |
| Synthetic | 0 | 8 | 0 | 8 |
| Total | 197 | 116 | 87 | 400 |
Model must generate a correct strategy on the first attempt. Click column headers to sort.
| # | Model | Compilation, % | Backtest, % | Trade, % | Judge, % ▼ |
|---|---|---|---|---|---|
| claude-opus-4.6 | 100.0 | 98.2 | 77.2 | 75.8 | |
| gpt-5.4 | 100.0 | 95.5 | 72.0 | 70.2 | |
| claude-sonnet-4.5 | 100.0 | 91.5 | 71.2 | 69.8 | |
| gpt-5.2-codex | 100.0 | 94.5 | 74.5 | 67.5 | |
| glm-5 | 100.0 | 92.4 | 70.3 | 65.4 | |
| claude-sonnet-4.6 | 100.0 | 85.8 | 66.2 | 65.0 | |
| kimi-k2.5 | 99.7 | 87.3 | 67.5 | 64.8 | |
| gemini-3-flash | 100.0 | 76.0 | 63.2 | 59.8 | |
| grok-4.1-fast | 99.2 | 70.2 | 56.1 | 48.9 | |
| deepseek-v3.2 | 100.0 | 75.8 | 50.0 | 48.8 | |
| qwen3-235b | 100.0 | 72.5 | 49.0 | 48.2 | |
| qwen3-coder-30b | 100.0 | 59.0 | 40.5 | 39.2 | |
| gemini-2.5-flash | 99.5 | 49.2 | 33.2 | 31.2 | |
| qwen3-14b | 98.0 | 42.2 | 27.8 | 25.2 | |
| qwen3-8b | 99.5 | 31.8 | 19.8 | 18.5 | |
| qwen3-4b | 98.8 | 24.6 | 16.4 | 12.3 | |
| qwen3-1.7b | 98.1 | 23.1 | 13.7 | 7.8 |
Model receives structured feedback and may repair errors across up to 10 turns.
| # | Model | Comp. % | Backtest % | Trade % | Judge % ▼ | AvgT | T1 | T3 | T5 | T10 |
|---|---|---|---|---|---|---|---|---|---|---|
| claude-opus-4.6 | 100.0 | 100.0 | 100.0 | 97.5 | 1.5 | 75.8 | 95.2 | 97.5 | 97.5 | |
| claude-sonnet-4.6 | 100.0 | 99.8 | 99.8 | 96.0 | 2.0 | 65.0 | 90.2 | 93.8 | 96.0 | |
| gpt-5.4 | 100.0 | 99.8 | 98.2 | 95.0 | 1.9 | 70.2 | 91.5 | 93.2 | 95.0 | |
| kimi-k2.5 | 100.0 | 100.0 | 98.8 | 93.5 | 2.3 | 64.8 | 84.2 | 89.2 | 93.5 | |
| claude-sonnet-4.5 | 100.0 | 100.0 | 99.5 | 93.0 | 2.0 | 69.8 | 90.0 | 91.2 | 93.0 | |
| gemini-3-flash | 100.0 | 97.5 | 94.5 | 91.8 | 2.4 | 59.8 | 83.5 | 88.2 | 91.8 | |
| glm-5 | 100.0 | 99.5 | 95.8 | 90.8 | 2.4 | 65.4 | 83.2 | 88.2 | 90.8 | |
| gpt-5.2-codex | 100.0 | 100.0 | 99.8 | 89.8 | 2.4 | 67.5 | 84.2 | 88.2 | 89.8 | |
| qwen3-235b | 100.0 | 98.2 | 93.8 | 87.2 | 3.1 | 48.2 | 74.0 | 81.2 | 87.2 | |
| grok-4.1-fast | 100.0 | 97.0 | 92.2 | 84.5 | 3.2 | 48.9 | 74.5 | 79.2 | 84.5 | |
| deepseek-v3.2 | 100.0 | 97.2 | 92.0 | 83.8 | 3.1 | 48.8 | 75.5 | 80.0 | 83.8 | |
| qwen3-coder-30b | 100.0 | 86.2 | 76.0 | 68.0 | 4.7 | 39.2 | 57.0 | 61.8 | 68.0 | |
| qwen3-14b | 100.0 | 83.2 | 67.0 | 62.7 | 5.3 | 25.2 | 51.0 | 56.0 | 62.7 | |
| gemini-2.5-flash | 100.0 | 79.8 | 65.2 | 62.5 | 5.2 | 31.2 | 52.0 | 57.8 | 62.5 | |
| qwen3-8b | 100.0 | 66.2 | 48.9 | 47.6 | 6.5 | 18.5 | 30.9 | 40.2 | 47.6 | |
| qwen3-4b | 98.1 | 52.5 | 41.9 | 31.9 | 7.4 | 12.3 | 17.5 | 25.6 | 31.9 | |
| qwen3-1.7b | 100.0 | 35.0 | 26.5 | 14.2 | 9.2 | 7.8 | 8.1 | 9.2 | 14.2 |
@article{khoroshilov2026quantcodebench,
title={QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies},
author={Khoroshilov Alexey and Chernysh Alexey and Ekhtibarov Orkhan and Kamkia Nini and Zmitrovich Dmitry},
year={2026},
url={https://github.com/LimexAILab/QuantCode-Bench}
}