Evaluating GPT-4o’s Reasoning Through Wordle and ReAct

Introduction

Large Language Models (LLMs) like GPT-4o excel at generating human-like text, but their ability to follow structured reasoning remains an open question. To better understand these limits, I ran an experiment using Wordle—a simple five-letter word puzzle that requires logical deduction and constraint tracking. Wordle is particularly useful for testing AI because success depends on applying feedback about letter positions across multiple guesses, something humans do naturally but LLMs may struggle with due to their pattern-based nature.

The goal was to see whether GPT-4o could consistently solve Wordle when guided by the ReAct framework (Reasoning + Acting). ReAct is designed to encourage step-by-step thinking by breaking down the AI’s responses into a loop of reasoning, action, and observation. Would this structure help the model maintain consistency, or would it still default to probabilistic guessing?

Lighthouse shining through digital fog, representing AI clarity

Experiment Setup

I gave GPT-4o a ReAct-style scaffold to simulate a Wordle-solving process:

Reasoning steps: The AI needed to state its logic before each guess (e.g., what letters it believed were valid and why).
Actions: It would propose a word guess.
Observations: Based on feedback (green, yellow, gray), the AI would adjust its reasoning for the next guess.

Two rounds of testing were performed:

Basic ReAct Scaffold: A minimal step-by-step reasoning format.
Enhanced Constraint Evaluation: A stricter version where the model had to list and evaluate known constraints before making a guess.

Experiment 1 – Basic ReAct

In the first round, GPT-4o performed surprisingly well in one test, solving the Wordle puzzle “PLUCK” by reasoning through letter positions without making backwards steps. The model followed its observations closely and adjusted its guesses logically.

However, in about half of the other tests, its reasoning and guesses became misaligned. For example, it correctly noted that certain letters were not in the solution, but then disregarded its own logic:

First Guess: "CRANE"
Feedback: A and E were yellow
AI reasoning for second guess: “The letters ‘A’ and ‘E’ are in the word but in different positions. I'll try another word using these letters in different spots and avoid ‘C’, ‘R’, ‘N’.”
AI guess: “LEAVE.”
Problem: Both A and E stayed in the same positions it had just ruled out.

This inconsistency suggested that simply saying its reasoning aloud wasn’t enough for the AI to enforce it.

Experiment 2 – Enhanced Constraint Evaluation

To address these issues, I added a second layer: a “reasoning loop” where GPT-4o had to explicitly enumerate known constraints and evaluate each potential guess against them before committing. I also clarified how I gave feedback on guesses to remove potential ambiguity.

While the AI gave detailed self-checks, the contradictions persisted. In one exchange, it confidently stated:

AI reasoning: “E must not be in the third position… I’ll ensure my guess avoids that.”
AI guess: “DUVET” (with E in the third position).

These errors weren’t about forgetting instructions—they stemmed from how LLMs generate text. GPT-4o doesn’t internally store a structured map of rules. Instead, it generates outputs by predicting the next token based on statistical likelihood, which can lead to these lapses even in well-scaffolded prompts.

Insights About LLM Behavior

These experiments highlight a fundamental reality: LLMs are pattern recognizers, not logical reasoners. They excel when tasks align with language patterns they’ve seen before but can falter when success depends on explicitly tracking rules or states over multiple steps. Wordle exposes this limitation because each guess is dependent on the accurate integration of prior feedback—something humans do instinctively but which requires deliberate prompting or external tools for AI.

Understanding this difference is vital for prompt engineering and AI deployment. This is one of the reasons I teach that AI’s “thinking” is probabilistic and pattern-based, which is why carefully designed prompts—and sometimes external logic systems—are necessary for reliable outcomes. Without that understanding, users may misinterpret AI outputs as reasoned conclusions when they’re really just statistically likely sequences of text.

Reflection and Conclusion

These experiments were never about proving that AI can’t play Wordle; they were about understanding where it stumbles and why. GPT-4o’s struggles here mirror the inconsistencies that can appear in real workflows—whether drafting documents, summarizing data, or generating technical outputs. The takeaway is not to dismiss AI but to recognize its strengths and weaknesses.

AI output still needs human oversight. Even the most advanced models can produce errors or illogical outputs. Humans remain essential for spotting gaps, refining prompts, and ensuring that AI-driven processes are trustworthy. Far from making human work obsolete, this creates opportunities for new kinds of expertise—people who know how to guide AI effectively.

Understanding that AI doesn’t “think” like humans helps us see why these difficulties arise and, more importantly, how to overcome them. By treating AI as a powerful but pattern-bound collaborator, we can design better scaffolding and prompts that make it more reliable, whether in a Wordle puzzle or a real-world business process.