Prompt Capacity Thresholds Gel Scaling Study | TrendHub: Neural Asset Stream

Research framing. The cross-model GEL run matters because it stops the team from assuming that prompt coaching improves every model equally. It does not. In the internal March 25 experiment set, model size changes the result dramatically.

The three-model pattern

Ultra-Small (1.5B): baseline 55.0%, coached 20.0%, delta -35.0pp
Target SLM (4B): baseline 40.0%, coached 65.0%, delta +25.0pp
Reference Coder (7B): baseline 45.0%, coached 70.0%, delta +25.0pp

The pattern is sharp enough to be actionable. Coaching did not create a gentle slope of improvement. It broke the 1.5B model, then helped the 4B and 7B models by the same headline amount. That is the empirical backbone behind the paper's "prompt capacity threshold" argument.

What likely happened below 3B

On the 1.5B model, the coaching overhead appears to consume too much of the model's working budget. The model has to juggle meta-instructions and code-level inspection at the same time, and the result is degraded detection with only marginal false-positive relief. In other words, the prompt became cognitively expensive for a model that could not carry both tasks well.

Why 4B looks like the threshold zone

The 4B run is the first point where coaching seems to become a net amplifier instead of a distraction. The model improves by 25 percentage points while false positives also drop from 62.5% to 56.5%. That is not just more output. It is a better quality frontier. Once the 4B model is paired with the right coaching density, the prompt becomes a scaffold instead of a burden.

What the 7B reference run adds

The 7B model posts the same +25pp improvement, but with a much larger false-positive reduction from 81.8% to 64.5%. That suggests the stronger model has enough headroom to absorb strategic framing and turn it into sharper filtering. It also hints that the "full protocol" version of coaching may only become worth it once the model has significantly more context discipline.

Editorial conclusion

The useful lesson is not "always use GEL." The useful lesson is "match prompt complexity to model capacity." A local audit workflow should pick one-step, three-step or full coaching based on how much reasoning room the target model can actually sustain.

Prompt capacity thresholds: what the 1.5B, 4B and 7B runs actually say

The three-model pattern

What likely happened below 3B

Why 4B looks like the threshold zone

What the 7B reference run adds

Editorial conclusion

References & source trail

Dynamic Coaching Ratio: how to apply GEL without overloading local models

Autonomous Hub Steering: The v2.6 Neural Engine Upgrade