Building for AI robustness and accuracy
This post explores the limitations of LLMs, the problem of AI alignment, and the quest for robust and safe AI. From hallucinations to jailbreaks, learn how the research community is responding.
The media has been abuzz with AI's promise over the last two years. But amidst the fanfare, a crucial voice of skepticism often goes unheard: the research community. With corporate propaganda rampant, it's harder than ever to separate fact from fiction.
In this series of posts, I'll shed light on the exciting work of researchers who are "poking around" under the hood of LLMs and multimodal AI systems. Their findings reveal the real challenges of building safe and reliable AI.
Why is the problem of AI Alignment complex?
The alignment problem is making an Artificially Intelligent system do what its creators and consumers intend it to do.
With AI in the recent past, many of the significant breakthroughs have come through transformers and large language models. We have so much data now, so why is the problem of steering AI so hard if we have data & sophisticated modeling techniques? It might have to do with the data itself or the lack thereof.
Most commercial IT systems are designed to be deterministic. Code and systems are built to ensure predictable, consistent outputs.
I’ll quote my advisor’s example to illustrate this problem. This one uses just a simple data table.
Determinism dilemma: Suppose we have a preference graph of people & their hobbies listed (Fig 1). The answer to “Does Jeff have any interest in classic rock?" will always return False in every database or lookup table. This is a property of the system to make it deterministic and consistent. If you work with data in any capacity, you’re familiar with some version of this limitation. At the core, LLMs might try to generalize from incomplete data and change their output based on imperfectly collated human preferences (RLHF).
The state of the art in collecting preferences and tuning models is RLHF, but it has significant flaws. Capturing accurate human preferences is an open area of research and a recurring topic in Reinforcement learning. Some foundational challenges of preference modeling include differences between individual and group preferences, social pressures, cognitive biases, bounds on rationality, and evolving preferences of humans.
Some researchers go as far as to say it’s futile for us (humans) to specify our intentions; instead, machines should implicitly figure out our preferences.
Robustness
Based on my experience working with LLMs, the most prominent risk in achieving safety is deterministic, consistent, and accurate responses. Hallucinations appear to be unwanted latent capabilities in LLMs that we want to eliminate ideally. However, no magic prompt or feature can control this emergent phenomenon.
Jailbreaking an LLM is also trivial. You can subvert months of safety training with just a few spurious characters or malicious prompts, sometimes simply misspelling a word or two. There are entire websites collecting patterns of prompts and spurious inputs for you to try out, and the current state of the art is trying to match these patterns and guard against them. It’s a time-tested solution from the world of machine learning. Most people who Productionize ML systems know that every general-purpose ML system is fronted by another ML system that guards the inputs.
How do you build for robustness?
In reliability engineering, the concept of the five nines is used: Can the system be available and responsive 99.999% of the time?
Similarly, Can you expect an LLM to give you consistent & accurate responses 99.99% of the time? It’s an aspirational goal and one that is worth pursuing for special-purpose AI systems, especially in the domains of healthcare, finance, etc, where the cost of getting something wrong could be disastrous.
I’m unsure of the state of the art regarding consistency and accuracy or whether such a measurement is even possible. I’ll focus on some exciting techniques for robustness that seem promising.
The most obvious one is testing for robustness. This involves building red teaming datasets, test beds, and more to check for bias detection and malice, and mitigating those risks. There’s tons of work on this from the ML world, and recently, there has been much more focus on LLMs.
Deep forgetting and Unlearning to create more tightly scoped models
This is a reasonably intuitive direction that involves training the model to forget. The core idea is that LLMs should only know what they are supposed to. We know pre-training on internet data can introduce falsehoods, foul language, and other unwanted latent capabilities. This paper focuses on how unlearning can help steer a model in a narrow domain.
Detecting out-of-distribution generalization: This research direction is also an open problem focusing on detecting out-of-distribution data and generalizations from LLMs about them. The researchers propose a learning theory termed “probably approximately correct” and how it can be leveraged to detect and hopefully course-correct inferences & learning on OOD data.
Improving safety despite intentional subversion: I found this paper interesting, especially its “trusted editing” approach to checking if a model is compromised and whose output cannot be trusted. I like this approach because it starts off not trusting the LLM. As we build powerful models with interdisciplinary capabilities, it will be hard for humans to test models for subterfuge. Automated approaches like these are scalable and could potentially help solve for robustness.
If you’re still here and curious to learn more, watch this lecture on Adversarial robustness with examples and methods.
Credits: Thanks to the great folks at Bluedot and their Alignment course which exposed me to these ideas.