← Back to all posts

ReAct Prompting with Local LLMs: A Field Test in Step-by-Step Reasoning

A model that learns to walk carefully will eventually run. First thought, then step, then stride.

Published April 28, 2025 · Tags: PromptEngineering, LLMs, ReAct, FieldTest

Setting the Stage

I wanted to answer a simple question: Can small and mid-sized local models reason in careful steps when you force them to slow down?

I wasn’t trying to crown a winner. No leaderboard, no benchmark charts. I wanted something closer to what happens in real Support work: you hand an AI assistant a messy task and you need it not to bluff.

So I ran the same small set of tasks against three local models I had available at the time:

  • Gemma3 1B — tiny, fast, efficient
  • LLaMA3.1 8B — compact, capable, sometimes imaginative
  • Deepseek-R1 14B — larger, slower, more self-aware in how it “thinks”

I used a ReAct-style prompting loop for each test. The loop looks like this:

Thought → Action → Observation → Thought → … → Answer

The idea is that the model shouldn’t just spit out an answer. It should take an action (even a simulated one), look at what happened, and then update its reasoning. In other words: behave less like a guesser and more like an analyst.

Why ReAct?

ReAct-style prompting started in agent work: give the model a job, let it think out loud, let it “do” something (like check a tool or read a file), then let it continue thinking based on what it saw.

That pattern shows up now in a lot of agent frameworks and orchestration libraries. But I wasn’t building a production agent here. I was asking something more basic:

If we apply that same structure to a normal reasoning task, does the model stay grounded better?

Said another way: can we use prompting structure to lower hallucination without adding any external tools at all?

Test 1: The Logic Puzzle

First test was an old riddle:

A man lives on the 10th floor of an apartment building.
Every morning, he takes the elevator down to the ground floor to go to work.
Every evening, he rides the elevator back up — but only to the 7th floor — then walks the rest of the way home.
He hates walking. Why does he stop at the 7th?

What happened:

  • Gemma3 1B lost the thread. It would start down one line of reasoning, then pivot, then forget what it already said. It was guessing answers instead of building a consistent story.
  • LLaMA3.1 8B sounded confident but made things up. It invented background details that weren’t in the puzzle and treated them as facts. Classic “sounds right, isn’t right.”
  • Deepseek-R1 14B did the best. It walked through clues, explained assumptions, and landed on a basically correct human-style answer (he’s short and can’t reach past 7 without help).

Takeaway from this task: abstract reasoning is still rough for smaller local models, even with a structured loop. You can slow them down, but you can’t force intuition that isn’t really there.

Test 2: Simulated Tool Use (Price Check)

Next test was closer to operations work. I gave each model a “tool use” style task: compare prices across a short list of items and pick the cheapest. For this run, the “tool use” was just simulated (no real API calls), but the structure was:

Thought: I should check each price.
Action: Look at Item A’s price.
Observation: Item A costs X.
…repeat until decision.

What happened:

  • Gemma3 1B actually held up well here. The task was concrete and step-driven, so it stayed polite, linear, and accurate. No drift, no hallucinated numbers.
  • LLaMA3.1 8B performed fine. It followed the loop and made a defensible choice.
  • Deepseek-R1 14B did something interesting: it kept the loop but rewrote it in its own “scratchpad” style (almost like a canvas). It behaved like, “Here’s my plan, here’s what I saw, here’s what that means.” That’s actually useful in an audit trail.

Takeaway from this task: when the problem is concrete, even a small model can act disciplined if you hand it rails.

So What Did We Learn?

This wasn’t a product bake-off. It was a sanity check on one idea: Does structure change model behavior?

  • Concrete, checkable tasks favor smaller models — especially if you force them to move in steps and not jump to the “final answer” voice.
  • Abstract reasoning still exposes the limits of small models. They either wander or they invent.
  • Bigger models aren’t just “smarter.” They’re better at staying consistent across steps, and they’re better at explaining what they’re doing. That last part matters for audit, not just accuracy.

The headline for me was this:

Good structure doesn’t guarantee a good answer.
Poor structure almost guarantees a bad one.

That applies directly to Support teams using AI today. If you drop raw customer context into ChatGPT and say “help,” you’re going to get something that sounds nice and may be wrong in a very expensive way.

But if you force a loop — “summarize the issue in plain language,” “list possible root causes,” “ask what data we’re missing,” “propose next action we can verify” — you’ve basically given that model rails. Now you can check each rail.

Final Thoughts

ReAct-style prompting isn’t magic. It’s scaffolding. It gives the model permission to slow down and gives you a surface to inspect.

And with local models — the kind you can run yourself, the kind you can keep off the public internet — that scaffolding can be the difference between “neat demo” and “this actually helps my team.”

I’ll keep testing variations on this: ReAct, Chain-of-Thought with guardrails, forced comparison prompts, etc. Because the real problem isn’t “Can AI talk?” The real problem is “Can we trust what it’s saying enough to act on it with a customer watching?”

Give your team rails, not risk

We train Support teams to ask better questions, verify AI output, and cut escalations without cutting QA.

Book a Call