QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies

Alexey Khoroshilov, Alexey Chernysh, Orkhan Ekhtibarov, Nini Kamkia, Dmitry Zmitrovich

Lime

QuantCode-Bench evaluates modern LLMs on their ability to generate executable algorithmic trading strategies for the Backtrader framework from natural-language descriptions. The benchmark contains 400 tasks of varying difficulty collected from Reddit, TradingView, StackExchange, GitHub, and synthetic sources. Models are evaluated through a four-stage pipeline in both single-turn and agentic multi-turn settings.

Evaluation Pipeline

A strategy is counted as successful only if it passes all four stages sequentially.

Compilation

Syntactically correct, no interpretation errors

Backtest

Executes on historical data without runtime errors

Trade

Places at least one trade on historical data

Judge

LLM judge confirms semantic alignment with the task

Dataset

400 trading-strategy generation tasks from diverse sources and difficulty levels.

Source Distribution

Difficulty Distribution

Source	Easy	Medium	Hard	Total
Reddit	147	16	20	183
TradingView	6	57	37	100
StackExchange	32	34	24	90
GitHub	12	1	6	19
Synthetic	0	8	0	8
Total	197	116	87	400

Single-Turn Results

Model must generate a correct strategy on the first attempt. Click column headers to sort.

Model	Compilation, %	Backtest, %	Trade, %	Judge, % ▼
claude-opus-4.6	100.0	98.2	77.2	75.8
gpt-5.4	100.0	95.5	72.0	70.2
claude-sonnet-4.5	100.0	91.5	71.2	69.8
gpt-5.2-codex	100.0	94.5	74.5	67.5
glm-5	100.0	92.4	70.3	65.4
claude-sonnet-4.6	100.0	85.8	66.2	65.0
kimi-k2.5	99.7	87.3	67.5	64.8
gemini-3-flash	100.0	76.0	63.2	59.8
grok-4.1-fast	99.2	70.2	56.1	48.9
deepseek-v3.2	100.0	75.8	50.0	48.8
qwen3-235b	100.0	72.5	49.0	48.2
qwen3-coder-30b	100.0	59.0	40.5	39.2
gemini-2.5-flash	99.5	49.2	33.2	31.2
qwen3-14b	98.0	42.2	27.8	25.2
qwen3-8b	99.5	31.8	19.8	18.5
qwen3-4b	98.8	24.6	16.4	12.3
qwen3-1.7b	98.1	23.1	13.7	7.8

Judge Pass Rate (Single-Turn)

Agentic Multi-Turn Results

Model receives structured feedback and may repair errors across up to 10 turns.

Model	Comp. %	Backtest %	Trade %	Judge % ▼	AvgT	T1	T3	T5	T10
claude-opus-4.6	100.0	100.0	100.0	97.5	1.5	75.8	95.2	97.5	97.5
claude-sonnet-4.6	100.0	99.8	99.8	96.0	2.0	65.0	90.2	93.8	96.0
gpt-5.4	100.0	99.8	98.2	95.0	1.9	70.2	91.5	93.2	95.0
kimi-k2.5	100.0	100.0	98.8	93.5	2.3	64.8	84.2	89.2	93.5
claude-sonnet-4.5	100.0	100.0	99.5	93.0	2.0	69.8	90.0	91.2	93.0
gemini-3-flash	100.0	97.5	94.5	91.8	2.4	59.8	83.5	88.2	91.8
glm-5	100.0	99.5	95.8	90.8	2.4	65.4	83.2	88.2	90.8
gpt-5.2-codex	100.0	100.0	99.8	89.8	2.4	67.5	84.2	88.2	89.8
qwen3-235b	100.0	98.2	93.8	87.2	3.1	48.2	74.0	81.2	87.2
grok-4.1-fast	100.0	97.0	92.2	84.5	3.2	48.9	74.5	79.2	84.5
deepseek-v3.2	100.0	97.2	92.0	83.8	3.1	48.8	75.5	80.0	83.8
qwen3-coder-30b	100.0	86.2	76.0	68.0	4.7	39.2	57.0	61.8	68.0
qwen3-14b	100.0	83.2	67.0	62.7	5.3	25.2	51.0	56.0	62.7
gemini-2.5-flash	100.0	79.8	65.2	62.5	5.2	31.2	52.0	57.8	62.5
qwen3-8b	100.0	66.2	48.9	47.6	6.5	18.5	30.9	40.2	47.6
qwen3-4b	98.1	52.5	41.9	31.9	7.4	12.3	17.5	25.6	31.9
qwen3-1.7b	100.0	35.0	26.5	14.2	9.2	7.8	8.1	9.2	14.2

Cumulative Success by Turn (Top Models)

Citation

@article{khoroshilov2026quantcodebench,
  title={QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies},
  author={Khoroshilov Alexey and Chernysh Alexey and Ekhtibarov Orkhan and Kamkia Nini and Zmitrovich Dmitry},
  year={2026},
  url={https://github.com/LimexAILab/QuantCode-Bench}
}