SWE-bench Verified Results
Real benchmark data showing how PRAT-powered context transforms coding agent performance — across models and price points.
All on mini-swe-agent: Sonnet 4.0 + XCE 66% → 73.4% — older-gen beating raw Sonnet 4.6, reaching Opus-level at 76.8% with cascade hybrid
MiniMax M2.5 + XCE: 78.2% on SWE-bench Verified — beating Claude Opus at 76.8%, at 16x lower cost
Model Comparison
| Model | Config | Resolve Rate | Oracle Rate | Cost / Instance |
|---|---|---|---|---|
| Sonnet 4.0 (baseline) | mini-swe-agent | 66% | — | $1.50 |
| Sonnet 4.0 + XCEXCE | Resolve@1 | 73.4% | 76.8% | $1.20 |
| Sonnet 4.6 (baseline) | mini-swe-agent | 72% | — | $3.00 |
| MiniMax M2.5 (baseline) | mini-swe-agent | 75.8% | — | $0.30 |
| MiniMax M2.5 + XCEXCE | SWE-bench Verified | 78.2% | — | $0.22 |
| Claude 4.5 Opus | Leaderboard | 76.8% | — | $8.50 |
A $0.30/1K token model outperforms a $5/1K token model with XCE
MiniMax M2.5 + XCE achieves 78.2% — surpassing Claude 4.5 Opus at 76.8% on SWE-bench Verified.
Resolve Rate Comparison
Cost per Resolved Instance
XCE in Action — 8,427 Tool Calls Across 499 Instances
1,677
xce_search
1,608
xce_callers
1,612
xce_callees
1,017
xce_architecture
1,493
xce_impact
1,020
xce_trace
67
Avg steps (resolved)
81
Avg steps (unresolved)
100%
Instances used XCE
Performance by Repository — Model Comparison
Resolve rates across repositories for different models. XCE-augmented models (blue/violet) consistently outperform their baselines.
Repository Coverage — Radar View
Spider chart showing how XCE expands the performance envelope across all repositories. Larger area = better coverage.
Performance by Repository
| Repository | Resolved | Rate | Avg Steps |
|---|---|---|---|
| Django | 172/231 | 74.5% | 59 |
| SymPy | 51/75 | 68% | 75 |
| Sphinx | 25/44 | 56.8% | 74 |
| Matplotlib | 23/34 | 67.6% | 67 |
| scikit-learn | 21/32 | 65.6% | 72 |
| xarray | 20/22 | 90.9% | 162 |
| pytest | 14/19 | 73.7% | 91 |
| requests | 8/8 | 100% | 81 |
Case Studies — How XCE Guided the Agent
Real examples where the baseline agent failed but XCE-augmented agent resolved the issue — with actual queries and context.
Temporal subtraction with mixed DateTimeField/DurationField output_field
72
steps
$0.18
cost
Without XCE
Failed — baseline agent couldn't locate the CombinedExpression class or understand how output_field resolution works for temporal operations.
With XCE
Resolved — XCE returned the exact CombinedExpression class and the output_field resolution chain.
Three targeted XCE searches progressively narrowed from DateTimeField to output_field resolution to CombinedExpression, giving the agent the full picture of how Django resolves types in arithmetic expressions.
UserCreationForm save() doesn't call save_m2m()
61
steps
$0.09
cost
Without XCE
Failed — baseline agent found UserCreationForm but missed the save_m2m() call chain and the impact on related models.
With XCE
Resolved — XCE returned the UserCreationForm class and impact analysis showing 33 impacted nodes across 12 modules.
xce_search returned the UserCreationForm class definition. Then xce_impact on django/contrib/auth/forms.py revealed 33 impacted nodes across 12 modules, helping the agent understand the full blast radius before making the fix.
ReadOnlyPasswordHashField disabled attribute issue
40
steps
$0.06
cost
Without XCE
Failed — baseline agent searched broadly for password hash handling, wasting steps on unrelated auth code.
With XCE
Resolved in 40 steps at $0.06 — XCE returned the exact class and impact analysis.
A single xce_search for "ReadOnlyPasswordHashField" returned the exact class definition with its __init__ method showing the disabled=True default. xce_impact then confirmed 33 impacted nodes, giving confidence the fix was safe.
Admin changelist _get_edited_object_pks regex prefix issue
39
steps
$0.06
cost
Without XCE
Failed — baseline agent found the admin options file but couldn't locate the specific regex pattern causing the issue.
With XCE
Resolved in 39 steps at $0.06 — XCE returned the exact function with the regex pattern.
xce_search for "_get_edited_object_pks admin formset prefix" returned the exact function with the regex pattern that needed fixing. The agent immediately saw the re.escape(prefix) issue and fixed it.
get_admin_url for readonly ForeignKey with custom admin site
70
steps
$0.21
cost
Without XCE
Failed — baseline agent found the admin helpers but couldn't trace the URL generation chain for custom admin sites.
With XCE
Resolved — XCE returned the get_admin_url function showing the hardcoded "admin:" prefix that needed to use the custom site name.
xce_search returned the get_admin_url function in django/contrib/admin/helpers.py, clearly showing url_name = "admin:%s_%s_change" — the hardcoded "admin:" prefix was the bug. The agent saw it immediately and replaced it with the dynamic admin site name.
Form Field __deepcopy__ shares error_messages between instances
57
steps
$0.08
cost
Without XCE
Failed — baseline agent found __deepcopy__ in widgets.py but missed the error_messages sharing issue in the Field base class.
With XCE
Resolved — XCE returned the __deepcopy__ method with a full analysis explaining the mutable dictionary sharing bug.
xce_search returned the __deepcopy__ method AND a detailed analysis explaining how obj.attrs references the same dictionary as self.widget.attrs, causing error_messages to be shared between form instances.
models.E015 ordering lookup check incorrectly handles transforms
61
steps
$0.12
cost
Without XCE
Failed — baseline agent found _check_ordering but couldn't understand the full lookup resolution chain for transforms.
With XCE
Resolved — XCE returned the complete _check_ordering function with the full field traversal logic.
xce_search returned the entire _check_ordering function showing how Django validates ordering fields — including LOOKUP_SEP splitting, related field traversal, and the transform check. The agent could see exactly where the transform handling was missing.
Raw Data
Full transparency — trajectory and prediction data available for download.