- The Antedote
- Posts
- Concrete Compliance
Concrete Compliance
Getting LLMs to actually do what you meant

1. Monkey’s Paw
It’s safe to say that LLMs are incredibly useful in designing sharper solutions to business problems.
Want to summarize a document? Throw it at an LLM. Want a ballpark guess at project requirements? LLM.
If you’ve been following my LinkedIn (https://www.linkedin.com/in/soban-raza/ for those of you wondering), then you’d notice a prompt the boys at Antematter cooked up for win/loss analysis on sales calls.

Our prompt for win/loss analysis on sales call transcripts.
And this stuff is just the tip of the iceberg! With all the stuff going down in ML-space, especially reproducing reasoning capabilities in ‘weak’ models, it seems our generation has found itself in ‘interesting times’.
But there’s a catch.
No matter how thought out your system is or how painstakingly engineered your prompt is, there’s a good chance your LLM will spit something out that breaks your requirements.
I genuinely can’t tell whether to laugh or cry when after 10 hours of perfecting my prompt, I still get something like “A good title for the next blog post is Insert Blog Title Here” despite clearly stating — no, BEGGING — for just the title and only the title.
I’ve asked the engineers about it, and even they’re annoyed.
Heck, as a litmus test, just ask any engineer on an AI-based solution about getting an LLM to spit out consistent JSON. If you don’t get a thousand yard stare, I genuinely want to poach your engineers.
But anyhow, enough of the rambling — the aim today is to try to fight specific cases of LLMs breaking the rules with the help of some tools and strategies. Hold on to your seats, we’re on bumpy turf!
2. Purple Prose
Let’s get the easy stuff out of the way first — how do you get an LLM to maintain stylistic or tonal consistency?
At present, there’s really only two systematic strategies to go for — prompt engineering and feedback loops.

How Prompt Engineering works — the idea is to refine one prompt over several LLM conversations.
So, what’s prompt engineering?
Think of your LLM as a five year old genie — not only does it have trouble figuring out what you mean, but it’ll also look for loopholes stemming from ambiguity.
As a consequence, you’re often forced to write in a very elaborate manner that leaves little-to-no room for alternative interpretations.
Let’s consider my original use case of potential blog titles — if I throw “Suggest me a title for a technical blog aimed at a business audience” at something like GPT, I’ll get a response like
“Okay, here’s some potential titles for a technical blog.
1. Tech Meets Business
2. Maximizing ROI
…”
What’s the problem? Well,
If I’m trying to automate my workflow and this requires putting titles in somewhere, I want just the title.
The ideas generated are too surface level, so I’m forced to provide more details.
Let’s suppose we instead throw “Suggest me a title, and ONLY the title, of a technical blog aimed at a business audience interested in automating complex workflows. The title suggested should address a sharp, specific pain-point which can potentially use AI. The title must address a specific business process.”
We now get "Revolutionizing Supply Chain Management: AI-Driven Automation for Seamless Operations". Still not quite there, but it’s a lot better.

How Feedback Loops work — the idea is to probe the LLM further based on what it has already said.
Feedback loops are fairly similar to engineered prompts — the only difference is that when you get a response from the LLM, you send it a prompt that specifies what to do differently.
Let’s go back to the first prompt we cooked up. Instead of the more elaborate prompt, we could respond with ”Give me only one title, and make sure it addresses a specific business concern”.
Now we get ”"Reducing Operational Bottlenecks: How AI Enhances Invoice Processing Efficiency", which is closer to what we want.
Despite what I’ve just said, there’s still the odd chance the LLM just gets a mind of its own and violates my orders. Every time I shoot a prompt, it’s really a toss with a materially messed up coin.
Unfortunately though, AI-written text is a mixed bag. It might cook up a good idea or two (provided it’s not hallucinating), but the actual writing lacks flair. So if you do want to automate this stuff, you better be on top of things.
3. Lock n’ Load
Now let’s tackle the trickier stuff — how can you get your LLM to spit out structured output without messing up?
That’s where a bunch of helpful tools come on to the scene!

A compilation of frameworks that enable valid JSON extraction from LLMs.
In a nutshell, we have
BAML, which relies on schemas transpiled to Pydantic. It uses a Rust-based error tolerant parser.
Guidance, which supports enums, regex, Pydantic, and JSON schemas for output definition. It lets you constrain even self-hosted models using token healing.
Instructor, which uses Pydantic to define output types. It supports LLM-based retries.
JSONFormer, which relies exclusively on JSON schemas. It supports constraining self-hosted models via content tokens.
LMQL uses its own constraint system to get the right outputs. It supports constraints on self-hosted models using token masking.
Marvin is powered by Pydantic under the hood — it supports LLM-based retries.
Mirascope is modelled after Pydantic. It supports retry calls powered by Tenacity.
Outlines is spicy in that in addition to the usual Pydantic and JSON schema, it also lets you use EBNF grammar to tackle outputs. This uses structured generation to constrain self-hosted models.
TypeChat is another output tool modeled on Pydantic. This one supports automatic LLM-based retries.
The tools above let you take a crack at extracting correct JSON output from an LLM of your choice — just make sure your model is supported.
Why’s JSON important, exactly? Well, in case you’ve been under a rock, the folks at JSON successfully infiltrated the web services tech space and made it the de-facto system for representing data. You can’t walk two meters without JSON being mentioned!
And since the modern AI-scape is built atop of what the web folks have been up to, it’s just the cards we’ve been dealt with. But eh, you learn to deal with it.
4. Au Revoir
With that, we’ve pretty much provided an overview of what it takes to get LLMs to do what you actually want them to do — be it getting the right kind of text, or just flat out using a library to force a specific kind of output.
If you’re particularly cheeky, you might want to incorporate spicier elements into your prompting strategy like the folks over at Windsurf. But that’s a topic for another day.
Hopefully improved models might reduce the hassle, but those of us not working on foundation models can’t bet on it.
If you’re interested in AI agents for your organization or products, visit our website at antematter.io.
Reply