QuantCode-Bench

QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies

Alexey Khoroshilov, Alexey Chernysh, Orkhan Ekhtibarov, Nini Kamkia, Dmitry Zmitrovich

Lime

QuantCode-Bench evaluates modern LLMs on their ability to generate executable algorithmic trading strategies for the Backtrader framework from natural-language descriptions. The benchmark contains 400 tasks of varying difficulty collected from Reddit, TradingView, StackExchange, GitHub, and synthetic sources. Models are evaluated through a four-stage pipeline in both single-turn and agentic multi-turn settings.

Evaluation Pipeline

A strategy is counted as successful only if it passes all four stages sequentially.

Compilation
Syntactically correct, no interpretation errors
Backtest
Executes on historical data without runtime errors
Trade
Places at least one trade on historical data
Judge
LLM judge confirms semantic alignment with the task

Dataset

400 trading-strategy generation tasks from diverse sources and difficulty levels.

Source Distribution

Difficulty Distribution

SourceEasyMediumHardTotal
Reddit1471620183
TradingView65737100
StackExchange32342490
GitHub121619
Synthetic0808
Total19711687400

Single-Turn Results

Model must generate a correct strategy on the first attempt. Click column headers to sort.

# Model Compilation, % Backtest, % Trade, % Judge, %
claude-opus-4.6100.098.277.275.8
gpt-5.4100.095.572.070.2
claude-sonnet-4.5100.091.571.269.8
gpt-5.2-codex100.094.574.567.5
glm-5100.092.470.365.4
claude-sonnet-4.6100.085.866.265.0
kimi-k2.599.787.367.564.8
gemini-3-flash100.076.063.259.8
grok-4.1-fast99.270.256.148.9
deepseek-v3.2100.075.850.048.8
qwen3-235b100.072.549.048.2
qwen3-coder-30b100.059.040.539.2
gemini-2.5-flash99.549.233.231.2
qwen3-14b98.042.227.825.2
qwen3-8b99.531.819.818.5
qwen3-4b98.824.616.412.3
qwen3-1.7b98.123.113.77.8

Judge Pass Rate (Single-Turn)

Agentic Multi-Turn Results

Model receives structured feedback and may repair errors across up to 10 turns.

# Model Comp. % Backtest % Trade % Judge % AvgT T1 T3 T5 T10
claude-opus-4.6100.0100.0100.097.51.575.895.297.597.5
claude-sonnet-4.6100.099.899.896.02.065.090.293.896.0
gpt-5.4100.099.898.295.01.970.291.593.295.0
kimi-k2.5100.0100.098.893.52.364.884.289.293.5
claude-sonnet-4.5100.0100.099.593.02.069.890.091.293.0
gemini-3-flash100.097.594.591.82.459.883.588.291.8
glm-5100.099.595.890.82.465.483.288.290.8
gpt-5.2-codex100.0100.099.889.82.467.584.288.289.8
qwen3-235b100.098.293.887.23.148.274.081.287.2
grok-4.1-fast100.097.092.284.53.248.974.579.284.5
deepseek-v3.2100.097.292.083.83.148.875.580.083.8
qwen3-coder-30b100.086.276.068.04.739.257.061.868.0
qwen3-14b100.083.267.062.75.325.251.056.062.7
gemini-2.5-flash100.079.865.262.55.231.252.057.862.5
qwen3-8b100.066.248.947.66.518.530.940.247.6
qwen3-4b98.152.541.931.97.412.317.525.631.9
qwen3-1.7b100.035.026.514.29.27.88.19.214.2

Cumulative Success by Turn (Top Models)

Citation

@article{khoroshilov2026quantcodebench,
  title={QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies},
  author={Khoroshilov Alexey and Chernysh Alexey and Ekhtibarov Orkhan and Kamkia Nini and Zmitrovich Dmitry},
  year={2026},
  url={https://github.com/LimexAILab/QuantCode-Bench}
}