Chain-of-Thought prompting is a technique that encourages an LLM to generate a series of intermediate reasoning steps ("chain of thought") before producing a final answer. In practice, this is often done by including examples or instructions in the prompt that show the model how to reason step by step. For example, instead of asking a question and expecting an immediate answer, we might prompt: "Let's think this through step by step." The model then produces a multi-step explanation or calculation, followed by the answer.
CoT prompting was pioneered by Wei et al. (2022), who showed that adding a few exemplar Q&A pairs with detailed reasoning (or even just a trigger phrase like "Let's think step by step") allowed large models to solve complex math word problems, logic puzzles, and commonsense reasoning tasks far more accurately. The key insight is that large language models can emulate a step-by-step problem-solving process if guided to do so, rather than treating every question as a black box mapping from query to answer.
Chain-of-thought prompting tends to be effective only with sufficiently large and capable models. Smaller models (e.g. with less than 10B parameters) often cannot reliably follow multi-step reasoning prompts. But models like GPT-3 (175B), PaLM (540B), and GPT-4 have shown an emergent ability to carry out multi-step reasoning when prompted appropriately.
Why produce a chain of thought?
For one, it allows the model to break down complex problems into manageable steps, akin to how a human expert might work through a problem on paper. By decomposing the task, the model can focus on one piece of reasoning at a time, which reduces the chances of making a mistake on tasks requiring multiple inferential steps. Additionally, the chain-of-thought is interpretable. Researchers can read the model's reasoning and see how it arrived at an answer. This transparency is valuable in research settings where trust and verification are important.
To illustrate, consider a math word problem that the model originally got wrong. After being prompted to produce a step-by-step solution, it reached the correct answer. The reasoning trace essentially helped the model avoid a careless error. This highlights how CoT prompting enables the model to tackle arithmetic and logic problems it previously could not.
In summary, Chain-of-Thought prompting means asking the model to "show its work." This often improves accuracy on complex tasks and yields interpretable solutions.
Why and when does CoT prompting work?
From a research perspective, CoT prompting works because it leverages the way large language models have learned from text. LLMs are typically trained on vast amounts of internet data, books, and other sources, which likely include examples of people reasoning through problems (think of forums where math problems are solved stepwise, or Q&A sites with explanations). By prompting the model to produce a reasoning chain, we are activating those learned patterns of stepwise explanation. Essentially, we nudge the model to use its latent knowledge in a more structured way, which can reduce errors from jumping straight to a conclusion.
Another way to understand CoT's effectiveness is to consider the cognitive load of a question. A complex question might require combining several facts, performing a calculation, or considering multiple aspects. If we force the model to answer in one step, it has to implicitly handle all these sub-tasks in a single forward pass of text generation. With CoT prompting, the model's generation is broken into parts. It can allocate more computation (more internal "thought" or, technically speaking, "reasoning tokens") to each part of the problem. In essence, CoT acts like dynamic time allocation for hard questions in which more steps for more complex problems.
CoT prompting is most useful in scenarios where reasoning or multi-step analysis is needed. According to the original CoT research, it shines on tasks like multi-step math problems, logical inference, and commonsense reasoning. In the context of finance and accounting, many tasks fit this description: analyzing a financial report involves reasoning over multiple sections of text, determining the implications of a policy change requires a chain of logical deductions, and diagnosing why a certain metric changed involves piecing together several data points.
When CoT is not helpful: If a question is purely factual recall (e.g., "What is the capital of Japan?"), a chain of thought might be unnecessary. The model either knows the fact or not. In some straightforward classification tasks, CoT might even introduce confusion if the reasoning is trivial.
A practical example: Financial ratio analysis
If you ask an LLM directly, "The company's revenue grew 5% but its net income fell 10%. What might explain this discrepancy?", a simple model might give a superficial answer or make something up. A CoT-prompted model, by contrast, could reason: "Revenue up 5% could be offset by higher costs. Perhaps expenses or one-time charges grew significantly. Let's consider: if costs grew more than revenue, net income could drop. A 10% profit drop with 5% revenue rise suggests margin contraction..." and then conclude with a plausible explanation.
The chain-of-thought ensures the model evaluates the components of the problem (revenue vs expense changes) and thus provides a more grounded answer.
Model uncertainty, calibration, and limitations
Although CoT prompting enhances the reasoning capabilities of LLMs, it does not make the models infallible. It is important for researchers to understand the limitations and potential pitfalls:
Overconfidence and calibration
LLMs, by default, often sound very confident even when they are incorrect. This is a well-documented issue: the probability or "confidence" a model assigns to an answer does not always correlate well with actual correctness (poor calibration). CoT prompting alone doesn't solve this, as a model can produce a very convincing chain-of-thought that leads to a wrong conclusion. In fact, a detailed but flawed explanation can be more misleading than a terse "I think the answer is X." Researchers should remain critical of model outputs.
When CoT may not help
If a task is primarily about factual recall or straightforward language understanding, CoT might be unnecessary. For example, asking "What year was the Sarbanes-Oxley Act passed?" doesn't benefit from a chain-of-thought, as the model either knows it (2002) or not. CoT could even introduce errors if the model tries to "derive" a fact from flawed memory. Similarly, if the question is extremely simple ("Calculate 2+2"), CoT is overkill.
Hallucinations and logical errors
CoT can mitigate some hallucinations (especially factual ones, when combined with techniques like ReAct or self-consistency), but it can also produce lengthy hallucinated justifications. A model might invent an entire sequence of financial analysis that sounds plausible but is entirely fictional with respect to the input data. Always ensure the chain-of-thought stays grounded in verifiable information. One best practice is to restrict CoT to using provided context rather than the model's open-ended knowledge, if possible. For example, prefix the prompt with: "Base your reasoning only on the report above."
Bias in reasoning
The chain-of-thought can reveal biases in the model's thinking. This is double-edged: on one hand, it's good to see them (transparency), on the other, the model might articulate problematic reasoning. For instance, a model might (incorrectly) reason that a CEO is "greedy" because of certain language, reflecting a stereotype rather than fact. Such bias can be spotted thanks to CoT, but users must be vigilant. In sensitive applications (like deciding if a statement is fraudulent or if an executive is doing something unethical), the model's reasoning may include unsound jumps. Intervention might be needed to correct or guide it.
Scaling and cost
CoT answers are longer. If using an API like OpenAI's, this means more tokens and higher cost. It also means slower responses. In a research pipeline where hundreds of thousands of textual analyses are analyzed, the token overhead could be significant. One has to balance the improved accuracy vs. the cost. Sometimes a hybrid approach works: use a quick non-CoT classification to narrow candidates, then use CoT on the borderline or most complex cases.
Model dependency
The effectiveness of CoT prompting is model-dependent. GPT-4, for example, is generally better at CoT reasoning (and more likely to follow through correctly) than GPT-3.5. If one is using open-source models, the difference can be stark. Some newer open models (like Anthropic's Claude or others) are trained with more emphasis on reasoning and might respond well to CoT prompts, whereas older GPT-2 level models will not. Always test on a small scale to ensure the model you use actually benefits from CoT. Wei et al. (2022) found that at smaller scales, CoT did not help or even hurt. The magic kicked in at a large model size.
Human in the loop
Especially in finance and accounting, expert oversight is needed. The outputs of a CoT-empowered model can be very convincing. It's easy to get seduced by the logical flow and assume it must be correct. But as any teacher knows, a student can have a very logical-looking solution that arrives at the wrong answer due to one assumption being off. The same is true for LLMs. Treat the chain-of-thought as you would a student's explanation: check the premises, check the math, verify the factual claims.
Uncertainty estimation
There are prompting strategies to get models to express uncertainty ("I'm not entirely sure but I think..."). Interestingly, Zhou et al. (2023) found that injecting phrases of uncertainty led to increased accuracy. Why? Possibly because it allows the model to consider alternatives rather than forcing a single answer. However, one must be careful: just because the model says "I'm not sure" doesn't guarantee it actually knows when it's wrong. Combining uncertainty prompting with CoT can produce more calibrated responses.
Best practice: Treat LLM outputs as a draft analysis: helpful, time-saving, but needing verification. CoT makes that draft more useful for verification. In critical research, you might use the model to get 90% of the way (with CoT), then have a research assistant or co-author verify key points, akin to how you would verify a colleague's work.
Conclusion
Chain-of-Thought prompting has opened a new frontier in how we interact with AI models, moving from terse question-answering to a more dialogue-like, explanation-rich process. For accounting and finance researchers and educators, this is a promising development. We have seen what CoT prompting is and why it works: it leverages the latent reasoning abilities of large language models by simply asking them to articulate intermediate steps.
As of 2025, tools like GPT-4 or even GPT-5 have made CoT prompting accessible without needing to fine-tune models or write custom code. You simply ask the model to "walk through the reasoning." Looking ahead, I expect CoT prompting and its offshoots to become standard in analytical AI applications. Models might become better calibrated, or have built-in mechanisms to check their own work (we see early signs of this in research on self-reflection and verification steps).
For the research community, an exciting possibility is combining human and machine reasoning, e.g., a researcher and an AI both provide chains-of-thought on a problem and then reconcile differences. Such "hybrid reasoning" could lead to more robust conclusions.
Back to blog