Success by a thousand cuts, aka building with LLMs in 2024

This post will be the first in a series touching on a number of GenAI related anecdotes, musings and lessons learnt. Let us start with some product basics… how do you build a solution heavily reliant on LLMs in mid 2024.

It’s quick and easy to create impressive Gen AI-driven proof-of-concept products. In just days, not months, you can develop something that will wow your board or less tech-savvy teammates. You can be a product and innovation rockstar!

But bringing these, supposedly quick and simple, products to production quality is often long, frustrating, and potentially dangerous to your brand. Ask some of the biggest players in tech how their recent launches have gone…

We were all blown away by the amazing automation promised for our workspaces, the wealth of knowledge about to be dumped on us, and we would soon all be sitting back and letting the AI deal with all our problems. This was shown to us in many highly produced product pitches in 2023… A year later, and either the products are still featureless, in beta, or gone live to public embarrassment. If the people building the models and investing billions can’t get it right, you must know it’s harder than it looks at first try.

My teams have been doing “boring” old AI/ML for years and had LLM workflows in production since the GPT3.5 days, but as we near the launch of some of our most ambitious, very obviously AI driven products and platforms, I thought it would be useful to jot a few high-level pointers down for those embarking on a similar journey.

A successful end needs a plan from the start.

The first problem of going from demo to production is understanding LLMs strengths and weaknesses. I may do a much longer opinion piece on where they excel and fail in real world applications later, but the first thing to consider is…. Keep their context simple, clear and measurable.

A common question around where to use AI in your organization is often phrased as “what would you do with a thousand interns”, this is exactly the mindset to take when designing an LLM-powered solution in today’s environment of what OpenAI refers to as Level 1 Assistants.

While it’s possible to create a single prompt that handles multi-step, complex tasks effectively in a GPT or with carefully curated inputs. Trying to build and maintain a complicated set of instructions at production scale and security will be infinitely harder in a single prompt, than dividing and conquering with smaller, discrete, manageable workloads. We all fall into the trap of starting simple then adding more and more context debt to our prompts, like we did with feature and technical debut in our products.

Even if an LLM can handle multiple actions in a single prompt during basic tests, things can quickly fall apart when you try to tweak functionality, expand the requirements, allow public access with varied inputs, or protect against the system going off script (more on that later).

Going back to the intern metaphor… would you trust your day one intern (no matter how well they could recite the full contents of Wikipedia) with a high-level job description and told to go represent your brand in front of your clients. Of course not… instead they get the reproducible, measurable and testable tasks first, and even then, multiple interns check each other’s work, all with careful oversight of a senior watching over their collective shoulders. (In this metaphor the senior is often a human in the loop).

But it all looked so easy!

And when I say break it down, I mean really break it down. Way more than you expect necessary. For example, if we wanted to summarize some text, and then decorate it with emojis. This should be done in two separate steps… “but surely not, I can write you a prompt right now to do both, it will be quicker and cheaper!”

Actually, it won’t be either, at any scale you care about. The prompt combining both those tasks will probably be as long as two prompts tackling them separately. Presuming you are only passing through the summary; the extra tokens are negligible. You can also leverage cheaper/faster models more reliably on narrow scope tasks and have better control over the exact output while leveraging fine tuning.

Fine tuning is easier to do than many expect, it also often costs less in total despite a higher per token charge. This is due to a fine-tuned model needing less prompting, fewer repeated instructions to drive home a point the AI ignores or finicky adjusting of the output.

When using multiple, simple actions, the only area you do lose out on is “emergent features” i.e. the output’s ability to achieve something unexpected, unfortunately most of the time in the corporate world… unexpected is bad… because unpredictable is bad… and inconsistent is bad… making many solutions untestable and untrustable… A lot of what makes GenAI exciting and interesting is what makes it very dangerous.

But what could really go wrong?

No doubt you have read all about hallucinations and how to control them. Wherever possible ground all your answers in your own knowledge and data and provide tools to do any calculations or processing. Do not let it ever try to do logic and math itself, especially if you need to guarantee the results are correct. I always try to use the AI more as a summarisation engine, rather than a fount of knowledge. LLMs are after all just a lossy compression of the internet, and we all know the internet is never wrong…

But more concerning than a hallucination is what happens if the AI ever tries to be helpful 😱 After all many of today’s models are trained to always try to answer the question, even to a fault. Hallucinations often occur because the input training on this specific topic is insufficient to have a clear right answer.

However sometimes it’s even more embarrassing when the model gets the answer right!… but you really don’t want it to. A perfect example we encountered was when creating an AI Chat based search experience for a marketplace, if the user asked where to get a certain product the model would very helpfully recommend the competition and many reasons why they were better!

In this case the model is just doing what was asked of it, it knew the correct answer and was just trying to be helpful. However, this poses a massive problem to designing AI solutions.

My code is doing things on its own!

Over the years, one fundamental truth has always held in software development. The program only does what is in the code. Sometimes users find novel ways to abuse the code, just see any speed run at AGDQ. But ultimately users cannot do anything not programmed, intended or not by the original developer.

But with LLMs… well now you have a very friendly, very knowledgeable, very helpful intern manning the customer support desk. They believe the customer is always right and will bend over backwards to help them in any way possible! And most of those ways are things you never intended, and worryingly never expected.

Sure, you can give the initial prompt a persona and try to reinforce principles like not recommending the competition, or bad mouthing your users. Forget trying to protect it from malicious user’s prompt injecting you, it’s hard enough just to try to stop the model from going wildly off script on its own. These are after all general-purpose models, trained to do anything.

And getting back to my initial point, this is where you must divide and conquer. The more instructions and control you try to place in the prompt the more confused the output and the more likely it is to go AWOL, especially as the context window of the conversation grows.

So what can we do about it?

The best solution for now are to

Break problems down into small, measurable, ideally not single prompt driven actions, and optimize each of those independently.
Leverage RAG and call functions to control the content, leveraging AI’s strength is data extraction to generate structure input and answer summarisation to minimize risk of hallucination.
Try to keep the core workflow clean of checks and erroneous instructions, instead gate keep business rules somewhere else in the workflow. Using techniques like vector space answer checking, or guardian models validating output rules, but ultimately not also responsible for output generation.
Minimize chat history length and ever-growing contexts in most of your actions. Yes, modern models have huge context windows, but still very easily start ignoring instructions as the context lengths grow. This is especially true if you are also returning a lot of RAG or search results data to work through before responding.
Keep the message history in the main chat assistant (presuming you need a chat) but treat each other action as its own fresh context each time, passing in only the data needed. This ensures more predictability and testability, because you can control and replay small inputs, over and over without fear of historical context polluting the current task.
Get anyone and everyone to test and try to encourage them not to test the critical flows. Ask them what else they would like to do, then let them try, almost immediately they will find those great “emergent features” you had never thought of..
When it is absolutely necessary you need an LLM to do more complex reasoning in a single large prompt. Then employ techniques like chain of thought and hide the thinking from the user. Claude’s prompt demonstrates this well with <antthinking> where basically the model labels its workings inside tags. This means it can still “think” through the problem, but only the answer is shown to the user in the UI.
And lastly… just roll with it. This is amazing technology, but also one which will work in mysterious and unpredictable ways. This can scare off executives and can pose a real threat to your brand if not handled correctly. Trying to over check and control the outputs will only make it worse (images of 1940s German soldiers anyone?).

So rather put your efforts into understanding where you can use GenAI in your existing products safely. Explore how to leverage its strengths while not exposing yourself to the weaknesses. But most importantly, go beyond playing and concepting and actually try to build something production ready.

You will be thrilled, scared, elated and bitterly disappointed in the technology all at once. What a time to be building products!

Next time, lets discuss how AI stands to revolutionize UI design.

One response to “Success by a thousand cuts, aka building with LLMs in 2024”

Is it all doomed to meh? – Matthew Hood

September 13, 2024 at 2:14 pm

[…] points to 80% of Generative AI projects failing in the first six months. And while there are opportunities to try to make projects more reliable and manageable, ultimately most businesses still have a lot to learn on how and where to leverage this […]

LikeLike