TrendHub Logo
TrendHub
Insights Journal
Experiment Note/6 min read Original analysis

Prompt capacity thresholds: what the 1.5B, 4B and 7B runs actually say

The March scaling run suggests coaching is not universally helpful. It hurt the 1.5B model, but improved both the 4B target model and the 7B reference coder by 25 percentage points.

TR

TrendHub Research Desk

Published March 25, 2026

Updated March 27, 2026

Format

Human-written editorial

Purpose

Original analysis over raw aggregation

Sources

2 referenced materials

Research framing. The cross-model GEL run matters because it stops the team from assuming that prompt coaching improves every model equally. It does not. In the internal March 25 experiment set, model size changes the result dramatically.

The three-model pattern

  • Ultra-Small (1.5B): baseline 55.0%, coached 20.0%, delta -35.0pp
  • Target SLM (4B): baseline 40.0%, coached 65.0%, delta +25.0pp
  • Reference Coder (7B): baseline 45.0%, coached 70.0%, delta +25.0pp

The pattern is sharp enough to be actionable. Coaching did not create a gentle slope of improvement. It broke the 1.5B model, then helped the 4B and 7B models by the same headline amount. That is the empirical backbone behind the paper's "prompt capacity threshold" argument.

What likely happened below 3B

On the 1.5B model, the coaching overhead appears to consume too much of the model's working budget. The model has to juggle meta-instructions and code-level inspection at the same time, and the result is degraded detection with only marginal false-positive relief. In other words, the prompt became cognitively expensive for a model that could not carry both tasks well.

Why 4B looks like the threshold zone

The 4B run is the first point where coaching seems to become a net amplifier instead of a distraction. The model improves by 25 percentage points while false positives also drop from 62.5% to 56.5%. That is not just more output. It is a better quality frontier. Once the 4B model is paired with the right coaching density, the prompt becomes a scaffold instead of a burden.

What the 7B reference run adds

The 7B model posts the same +25pp improvement, but with a much larger false-positive reduction from 81.8% to 64.5%. That suggests the stronger model has enough headroom to absorb strategic framing and turn it into sharper filtering. It also hints that the "full protocol" version of coaching may only become worth it once the model has significantly more context discipline.

Editorial conclusion

The useful lesson is not "always use GEL." The useful lesson is "match prompt complexity to model capacity." A local audit workflow should pick one-step, three-step or full coaching based on how much reasoning room the target model can actually sustain.

Source trail

Referenced materials

Scaling experiment archive

GEL_Paper/06_Results/gel_experiment_v2_results.json

Cross-model summary

Ultra-Small 1.5B, Target SLM 4B, Reference Coder 7B

Previous Insight

Dynamic Coaching Ratio: how to apply GEL without overloading local models