Research framing. The cross-model GEL run matters because it stops the team from assuming that prompt coaching improves every model equally. It does not. In the internal March 25 experiment set, model size changes the result dramatically.
The three-model pattern
- Ultra-Small (1.5B): baseline 55.0%, coached 20.0%, delta -35.0pp
- Target SLM (4B): baseline 40.0%, coached 65.0%, delta +25.0pp
- Reference Coder (7B): baseline 45.0%, coached 70.0%, delta +25.0pp
The pattern is sharp enough to be actionable. Coaching did not create a gentle slope of improvement. It broke the 1.5B model, then helped the 4B and 7B models by the same headline amount. That is the empirical backbone behind the paper's "prompt capacity threshold" argument.
What likely happened below 3B
On the 1.5B model, the coaching overhead appears to consume too much of the model's working budget. The model has to juggle meta-instructions and code-level inspection at the same time, and the result is degraded detection with only marginal false-positive relief. In other words, the prompt became cognitively expensive for a model that could not carry both tasks well.
Why 4B looks like the threshold zone
The 4B run is the first point where coaching seems to become a net amplifier instead of a distraction. The model improves by 25 percentage points while false positives also drop from 62.5% to 56.5%. That is not just more output. It is a better quality frontier. Once the 4B model is paired with the right coaching density, the prompt becomes a scaffold instead of a burden.
What the 7B reference run adds
The 7B model posts the same +25pp improvement, but with a much larger false-positive reduction from 81.8% to 64.5%. That suggests the stronger model has enough headroom to absorb strategic framing and turn it into sharper filtering. It also hints that the "full protocol" version of coaching may only become worth it once the model has significantly more context discipline.
Editorial conclusion
The useful lesson is not "always use GEL." The useful lesson is "match prompt complexity to model capacity." A local audit workflow should pick one-step, three-step or full coaching based on how much reasoning room the target model can actually sustain.