Setting the Stage
Recently, I set out to explore a simple but important question:
How well can local language models reason step-by-step when given structured guidance?
This wasn’t about exhaustive benchmarking. I wasn’t chasing leaderboards or performance scores.
Instead, I wanted something closer to how we reason day-to-day: deliberate, small steps taken with care.
To explore this, I tested three local LLMs across two different tasks using ReAct-style prompting:
- Gemma3 1B – tiny, fast, and efficient
- LLaMA3.1 8B – compact, capable, thoughtful
- Deepseek-R1 14B – large, deliberate, canvas-style thinker
This post offers a glimpse into how ReAct prompting can shape LLM reasoning patterns in everyday scenarios.
Why Use ReAct Prompting for LLM Reasoning?
ReAct-style prompting teaches models to move through a specific loop:
Thought → Action → Observation → Thought, gathering insights steadily rather than rushing toward an answer.
Originally developed for agent systems and web-based tool use, the ReAct framework has inspired larger architectures like LangChain, BabyAGI, and ReWOO.
Each of these builds on the principle that structured prompting improves how models perceive and interact with the world.
In this field test, I wasn’t just evaluating answers.
I wanted to see if the step-by-step ReAct reasoning style could improve the way local models thought through tasks.
Testing Local LLMs with ReAct-Style Prompting
Task 1: The Logic Riddle ("The Man in the Elevator")
Each model faced a classic logic puzzle:
A man lives on the 10th floor of an apartment building.
Every morning, he takes the elevator down to the ground floor to go to work.
Every evening, he rides only to the 7th floor and walks the rest of the way home.
He hates walking. Why stop at the 7th floor?
Results:
- Gemma3 1B struggled to maintain context. It overlooked important clues and reached disjointed conclusions.
- LLaMA3.1 8B created plausible but imagined stories, hinting at creativity but drifting from grounded reasoning.
- Deepseek-R1 14B came closest to intuitive human reasoning. It used its own canvas logic structure and arrived at an acceptable answer.
This small riddle task underscored that abstract reasoning remains a challenge for smaller local LLMs, even with structured guidance.
Task 2: Simulated Tool Use (Price Checker)
Next, I simulated a tool-use scenario:
Each model was tasked with checking a list of item prices, one by one, and choosing the cheapest.
Results:
- Gemma3 1B performed surprisingly well. The concrete, step-driven nature of the task fit its strengths. It followed the ReAct loop cleanly without hallucinating context.
- LLaMA3.1 8B and Deepseek-R1 14B both handled the structured prompting well, with Deepseek adapting the loop slightly into its internal scratchpad style—a useful variation when working with canvas-style LLMs.
The ReAct framework provided a consistent scaffold that smaller local models navigated with surprising reliability.
What I Learned About Step-by-Step Reasoning with Local LLMs
This wasn’t a formal benchmark.
It was a practical glimpse into how prompting structure impacts local LLM reasoning.
- Concrete tasks favor smaller models when guided carefully through ReAct loops.
- Abstract tasks expose the limits of compact models, even with strong prompting.
- Larger models adapt better to structured reasoning styles, though they can still drift without tight framing.
Most importantly:
Good structure won’t guarantee perfect results.
But poor structure almost guarantees poor results.
The way we prompt—how we create small, navigable steps—matters as much as the LLM’s model weights or capabilities.
Understanding how a model reasons, not just what it outputs, is a critical part of deploying LLMs effectively.
Final Thoughts: Shaping Model Behavior Through Structured Prompting
This experiment was just a small test, but it reinforced something important:
If we want local LLMs to behave more reliably, prompt structure is non-negotiable.
Clear step-by-step reasoning frameworks like ReAct provide scaffolding that even strong models benefit from.
And when working with small or mid-sized local LLMs, that scaffolding can make the difference between scattered guesses and genuine, useful insights.
I’ll be sharing more field tests soon—including variations on ReAct, Chain-of-Thought, and other structured prompting styles.
If you've tried ReAct prompting with local models, I’d love to hear about your experiences.