SWE-bench Verified Results

Real benchmark data showing how PRAT-powered context transforms coding agent performance — across models and price points.

All on mini-swe-agent: Sonnet 4.0 + XCE 66% → 73.4% — older-gen beating raw Sonnet 4.6, reaching Opus-level at 76.8% with cascade hybrid

MiniMax M2.5 + XCE: 78.2% on SWE-bench Verified — beating Claude Opus at 76.8%, at 16x lower cost

Model Comparison

ModelConfigResolve RateOracle RateCost / Instance
Sonnet 4.0 (baseline)mini-swe-agent66%$1.50
Sonnet 4.0 + XCEXCEResolve@173.4%76.8%$1.20
Sonnet 4.6 (baseline)mini-swe-agent72%$3.00
MiniMax M2.5 (baseline)mini-swe-agent75.8%$0.30
MiniMax M2.5 + XCEXCESWE-bench Verified78.2%$0.22
Claude 4.5 OpusLeaderboard76.8%$8.50

A $0.30/1K token model outperforms a $5/1K token model with XCE

MiniMax M2.5 + XCE achieves 78.2% — surpassing Claude 4.5 Opus at 76.8% on SWE-bench Verified.

Resolve Rate Comparison

Cost per Resolved Instance

XCE in Action — 8,427 Tool Calls Across 499 Instances

1,677

xce_search

1,608

xce_callers

1,612

xce_callees

1,017

xce_architecture

1,493

xce_impact

1,020

xce_trace

67

Avg steps (resolved)

81

Avg steps (unresolved)

100%

Instances used XCE

Performance by Repository — Model Comparison

Resolve rates across repositories for different models. XCE-augmented models (blue/violet) consistently outperform their baselines.

Repository Coverage — Radar View

Spider chart showing how XCE expands the performance envelope across all repositories. Larger area = better coverage.

Performance by Repository

RepositoryResolvedRateAvg Steps
Django172/231
74.5%
59
SymPy51/75
68%
75
Sphinx25/44
56.8%
74
Matplotlib23/34
67.6%
67
scikit-learn21/32
65.6%
72
xarray20/22
90.9%
162
pytest14/19
73.7%
91
requests8/8
100%
81

Case Studies — How XCE Guided the Agent

Real examples where the baseline agent failed but XCE-augmented agent resolved the issue — with actual queries and context.

Djangodjango__django-13128

Temporal subtraction with mixed DateTimeField/DurationField output_field

72

steps

$0.18

cost

Without XCE

Failed — baseline agent couldn't locate the CombinedExpression class or understand how output_field resolution works for temporal operations.

With XCE

Resolved — XCE returned the exact CombinedExpression class and the output_field resolution chain.

Three targeted XCE searches progressively narrowed from DateTimeField to output_field resolution to CombinedExpression, giving the agent the full picture of how Django resolves types in arithmetic expressions.

Djangodjango__django-16333

UserCreationForm save() doesn't call save_m2m()

61

steps

$0.09

cost

Without XCE

Failed — baseline agent found UserCreationForm but missed the save_m2m() call chain and the impact on related models.

With XCE

Resolved — XCE returned the UserCreationForm class and impact analysis showing 33 impacted nodes across 12 modules.

xce_search returned the UserCreationForm class definition. Then xce_impact on django/contrib/auth/forms.py revealed 33 impacted nodes across 12 modules, helping the agent understand the full blast radius before making the fix.

Djangodjango__django-13741

ReadOnlyPasswordHashField disabled attribute issue

40

steps

$0.06

cost

Without XCE

Failed — baseline agent searched broadly for password hash handling, wasting steps on unrelated auth code.

With XCE

Resolved in 40 steps at $0.06 — XCE returned the exact class and impact analysis.

A single xce_search for "ReadOnlyPasswordHashField" returned the exact class definition with its __init__ method showing the disabled=True default. xce_impact then confirmed 33 impacted nodes, giving confidence the fix was safe.

Djangodjango__django-12143

Admin changelist _get_edited_object_pks regex prefix issue

39

steps

$0.06

cost

Without XCE

Failed — baseline agent found the admin options file but couldn't locate the specific regex pattern causing the issue.

With XCE

Resolved in 39 steps at $0.06 — XCE returned the exact function with the regex pattern.

xce_search for "_get_edited_object_pks admin formset prefix" returned the exact function with the regex pattern that needed fixing. The agent immediately saw the re.escape(prefix) issue and fixed it.

Djangodjango__django-14855

get_admin_url for readonly ForeignKey with custom admin site

70

steps

$0.21

cost

Without XCE

Failed — baseline agent found the admin helpers but couldn't trace the URL generation chain for custom admin sites.

With XCE

Resolved — XCE returned the get_admin_url function showing the hardcoded "admin:" prefix that needed to use the custom site name.

xce_search returned the get_admin_url function in django/contrib/admin/helpers.py, clearly showing url_name = "admin:%s_%s_change" — the hardcoded "admin:" prefix was the bug. The agent saw it immediately and replaced it with the dynamic admin site name.

Djangodjango__django-11880

Form Field __deepcopy__ shares error_messages between instances

57

steps

$0.08

cost

Without XCE

Failed — baseline agent found __deepcopy__ in widgets.py but missed the error_messages sharing issue in the Field base class.

With XCE

Resolved — XCE returned the __deepcopy__ method with a full analysis explaining the mutable dictionary sharing bug.

xce_search returned the __deepcopy__ method AND a detailed analysis explaining how obj.attrs references the same dictionary as self.widget.attrs, causing error_messages to be shared between form instances.

Djangodjango__django-12858

models.E015 ordering lookup check incorrectly handles transforms

61

steps

$0.12

cost

Without XCE

Failed — baseline agent found _check_ordering but couldn't understand the full lookup resolution chain for transforms.

With XCE

Resolved — XCE returned the complete _check_ordering function with the full field traversal logic.

xce_search returned the entire _check_ordering function showing how Django validates ordering fields — including LOOKUP_SEP splitting, related field traversal, and the transform check. The agent could see exactly where the transform handling was missing.

Raw Data

Full transparency — trajectory and prediction data available for download.

Ready to supercharge your coding agents?

Give your agents the context they need to solve real-world issues.