TrendHub Logo
TrendHub
Insights Journal
Research Brief/7 min read Original analysis

GEL Ablation Study: Why strategic coaching doubled the 4B audit yield

An internal ablation study shows that generalized coaching, even without direct bug hints, pushed a 4B code-audit model from 35.0% to 72.5% mean issue detection with strong statistical support.

AY

Alex Yoon

Published March 25, 2026

Updated March 27, 2026

Format

Human-written editorial

Purpose

Original analysis over raw aggregation

Sources

2 referenced materials

TrendHub note. This article is based on the internal GEL paper draft and the ablation result set archived in the HQ research workspace. The goal is not to restate the paper line by line, but to explain why the result matters for people trying to run useful code-audit workflows on smaller local models.

What the ablation actually tested

The internal study isolates three prompt conditions on the same 4B model: a plain baseline audit prompt, a coached prompt with direct issue hints, and a coached prompt that keeps the reflective structure but removes explicit bug-category leakage. That distinction matters because a strong result is only useful if it comes from better reasoning, not from accidentally feeding the answer key back into the model.

  • Baseline mean detection rate: 35.0%
  • Coached-General mean detection rate: 72.5%
  • Coached-Specific mean detection rate: 85.0%
  • Statistical significance for the generalized coaching arm: p = 0.0015
  • Effect size for generalized coaching vs. baseline: Cohen's d = 2.21

Why the generalized coaching result is the real headline

The most important number is not the 85.0% peak score under the hint-heavy condition. It is the jump from 35.0% to 72.5% when the prompt keeps the strategic frame but strips out answer leakage. That result says the model is not merely parroting categories it was told to inspect. It is doing meaningfully better work once the prompt forces it to think about robustness, maintainability and edge-case pressure before scanning the code.

In practical terms, that doubles the yield of a 4B audit pass without moving to a much more expensive hosted model. For teams that want to keep review loops local, cheap and frequent, that is the kind of delta that changes whether a workflow feels viable or merely interesting.

What the leakage gap still tells us

The distance between Coached-General and Coached-Specific is still useful. It quantifies the remaining advantage of direct hinting and gives the team a way to separate metacognitive lift from prompt contamination. The internal result set reports a 0.74 effect-size gap for that leakage component, which is material, but it is much smaller than the generalized coaching effect itself. That is why the study can still defend the central claim.

Operator takeaway

If you are running a 3B to 7B class model for code review, this is the evidence that a small amount of structured pre-audit coaching can be worth more than adding another layer of generic prompt ornamentation. The useful pattern is to prime the model with analytical posture, not to bury it under a long checklist of named failure classes.

Source trail

Referenced materials

Internal paper draft

Strategic Coaching as a Performance Amplifier for Small Language Models in Code Auditing: An Ablation Study

Ablation result set

GEL_Paper/06_Results/ablation_results.json

Next Insight

AI Frontier Brief