Reasoning About Reasoning - III

Now you don't

1. Here We Go Again

We’re back, and we’re now going to talk about reasoning evaluations in scenarios involving black-box language models — a quick refresher, a model is black-box if its internal details (i.e. activations, weights, etc.) are unknown to you.

Black-box contexts are the backbone of modern AI-based applications. And down the line, these will be the bedrock that agentic AIs rest on. As a consequence, we’re forced to deal with the matter of reasoning indirectly at the application level.



The Antedote

Subscribe for regular doses on AI agents & LLMs.




2. Tennis For Two

Unlike the framework discussed in the previous issue, we’re going to have to take a radically different approach. We need a language model to evaluate our language model’s reasoning!

You might recall from the last issue that we strongly discouraged the use of a language model for generating synthetic reasoning data — that’s because we had access to activations and could rely on those. We have no such luck here. We’re forced to get another language model to help us out.

Here’s a diagrammatic overview of how things’ll work out:

(i) Standard data flow (ii) Data flow with evaluator in the middle

It’s not all too different in structure from what we discussed in the last issue. The differences are the inability to view activations, and making use of an evaluator agent.

3. Enforcers

The evaluator agent will be responsible for assessing reasoning, and the key assumption we’re making is that this model too is a blackbox — otherwise, unless you happen to have a high-end model yourself, it’d be unwise to have a low-resource language model assess an external one.

Our modus operandi will be carefully constructed system prompts being fed to the evaluator. Here’s the essential steps involved:

  • Defining the evaluator’s purpose.

  • Outlining evaluation metrics.

  • Implementing the flow.

  • Assessing the evaluator.

In addition, we can do something nifty and setup a feedback loop — this would let your evaluator repeatedly rerequest the language model for an answer until it’d be satisfied.

Of course, we can’t do much without having a purpose in mind. Why do we care about reasoning at all?

Let’s suppose we had an LLM write up blog articles for us (spoiler alert — ours isn’t), we’d care about not just the correctness of the article but also if it’d click with our readerbase. To that end, we might want to see a couple jokes and general tonality in line with our blog’s general vibe.

To explain all of that to an evaluator model, we’ll need to prompt it with a system prompt. In essence, a system prompt is akin to roleplaying like you would on Runescape. You write up a long, well-structured prompt that reads out like a character backstory and hand it over to your evaluator. The key thing is to be highly specific about every single facet involved in the process — you don’t want your evaluator being forced to guess when faced with the unknown.

There’s more to this system prompt than purpose — we’ll need to talk about the evaluation criteria that’ll be outlined.

4. Judgement

If you read the last issue, you might recall something handy we whipped up — the CCR matrix.

The CCR Reasoning Matrix

To recap, the CCR matrix cares about the following:

  • Completeness — is the answer provided by the model a complete or incomplete response?

  • Concision — is the answer provided by the model efficiently stated or verbose?

  • Relevance — is the answer provided by the model composed entirely of relevant facts and ideas?

This is the most general-yet-versatile framework for assessing reasoning capabilities — the good thing about it is that you can add on any additional criteria based on the context you’re working in.

Want your evaluator to assess an LLM’s ability to grade physics exams? Might as well add appealing to laws of physics in the matrix. Want it to assess accounting books? Better add adherence to GAP.

Additional expectations such as constraint adherence and accuracy can be added on-top of the CCR matrix. We’re keeping things simple here mainly since every ICP has different priorities for reasoning evaluation.

5. Tools of the Trade

Some fairly common tools for agents and evaluation.

Let’s take a minute or two to discuss some tools you might find useful in defining evaluator agents for your own contexts — here they are:

  • LlamaStack

  • LangSmith

  • Confident-AI

  • CrewAI

You can use these frameworks to create an LLM-based agent with judgement ability — since our concern is reasoning evaluation at runtime, just toss your agent both the input prompt and the generated response. Your judge will do the rest.

Of course, there’s some key traits to the tools themselves. Let’s break it down.

LlamaStack has the main advantage of efficiency and robustness. You’d be pressed to run into a major problem running with LlamaStack — it’s reliable at what it does.

LangSmith supports trajectory evaluation — this means it’ll check for the correctness of the partial steps taken by an LLM agent. In addition, LangSmith has solid support for additional tools and plugins.

Confident-AI is a paid service — you might know them as the folks behind deepeval. In addition to creating judge-like agents, they support additional evaluation schemes such as summarization, answer relevancy, faithfulness, etc.

CrewAI is very solid for splitting up a task pipeline into discrete agentic pieces — if you’re trying to translate books using AI, you might have one agent for reading, one for a basic translation, and another for stylistic editing.

6. Go With The Flow

Our next task, once we’ve fed a system prompt with context and expectations to the evaluator, is to actually incorporate the program flow. This isn’t particularly tricky, simply pass the output of your model over to the evaluator.

Assuming your system prompt was well-designed, your evaluator ought to provide a numerical score to the supplied answer. It will base its answer off of your system prompt, so think of it as a monkey’s paw situation and be very careful in how you word it.

In case your evaluator is messing up, you may need to refine the system prompt or preemptively write “corrective prompts” to squeeze out correct answers from the evaluator. This could be triggered by basic observations, e.g. prompting the evaluator with “That’s good, but please provide me a single number and nothing else” if the result expected was a number and it spat out an essay.

Of course, you don’t want to let all this happen unchecked — to that end, you may want to generate logs for the situation. Within your system prompt, leave room for it to “remember” the reasons behind its scoring, and then have your program request comments after it provides a score. Store these comments in a location of your choosing, and carry on!

7. Around the World

Here’s a neat trick — we can have the evaluator setup a feedback for the original model and force higher quality answers out of it.

Of course, you’ll have to design a system prompt for this case scenario.

What you’ll want is for your system to regenerate an answer if the evaluator considers it subpar. If you’ve stored its comments somewhere, you could potentially use it in subsequent prompts to your primary language model.

If the evaluator spits out something like “The text incorrectly identifies fish as a type of fruit”, you better be sending this to the model!

Now we do need to set a limit to how many times we can regenerate an answer — since we’re using an external model, there’ll be costs attached to each repeat answer. And besides, we never want to risk an eternal loop. 3-4 iterations ought to be a safe bet.

8. That’s A Wrap

With that, we’ve pretty much wrapped up our discussion of reasoning evaluation. Observant readers might’ve noticed that this sort of stuff involves a fair bit of guesswork and spit magic — for those of us at the application level of AI, this will be all that we can muster.

We’ve provided a fairly general pipeline that can be adapted to your context — if you’re dealing with an external LLM for your application and want to check for reasoning, might as well try what we’ve cooked up. As they say, ”Good artists copy, great artists steal”.

If you are building AI agents, we can possibly partner up. Visit our website or schedule a call.

Reply

or to participate.