While building an AI-driven side project, I ran into a design tradeoff that kept resurfacing with each LLM-powered feature:
Should a single prompt handle an entire workflow, or should each step be its own call?
The Single-Prompt Approach
In my initial version of a feature, the system did everything it could in one shot. A single LLM prompt would:
- Categorize an item from user context
- Decide whether there was enough information to score it
- Generate all follow-up questions it thought were necessary at once if context was missing
That approach worked initially, but it had a clear limitation: the follow-up questions were often repetitive or poorly targeted, because the model couldn’t incorporate context from previous answers. Once questions were generated, there was no opportunity to adapt.
The Step-Based Approach
My latest version breaks the workflow into discrete, single-purpose steps:
- Categorizing the item
- Explicitly deciding whether there’s enough context
- Generating follow-up questions through a feedback loop when context is missing
- Scoring the item with a structured rationale
Splitting the workflow into smaller prompts added orchestration overhead, but changed the system in important ways:
- Each step became independently observable and testable
- Failures could be detected and recovered from in isolation
- Schemas were easier to enforce
- Each capability could be versioned and rolled out independently
Runtime Model Selection with OpenRouter
To achieve this, I integrated OpenRouter so each step can choose its model at runtime. That makes model choice observable: outputs can be judged, analyzed, and scored per task.
What stood out wasn’t that one approach was universally better, but that they optimize for different goals. Single prompts optimize for speed and simplicity. Step-based prompts optimize for control, reliability, and evolvability.
What’s Next
The next step is adding an evaluation layer (via Langfuse) to track quality, consistency, and failure modes over time. With per-step routing in place, the system can treat model selection itself as a feedback loop—continuously scoring LLM performance and adjusting which model is used for each task based on what actually performs best.