Why tamper LLMs with guardrails?


Say what you will about LLM technology, it’s remarkable that we can do computations on the scale of billions of parameters training on large chunks of humanity’s collective text and media at all — and then it’s remarkable how you can talk to “it” in everyday language and get any kind of recognizable response out of it all, often (but not always) a pretty good one, and this is all based on the simple but powerful “select the best next token” algorithm run in a loop. The concept would have made a terrific sci-fi series, and here we are with it working in our cloud at scale.

From the get-go ChatGPT had guardrails and I suspect that its quality suffered as a result — we didn’t get a chance to compare. I understand that guardrails would be nice in theory, but they come with a cost. Specifically, in terms of the threat model, I believe the point is to protect well-mannered users from getting viciously harmful responses unawares; I’m much less worried about guardrails that prevent someone dead set on getting the LLM to say naughty things being able to do that. By way of analogy if LLM were a power tool, protect users from accidental injury, but not necessarily if they willingly abuse the tool to do something patently dangerous. For reasons I cannot understand, the industry seems unified in attempting the latter stance, and I don’t think it’s one they are likely to succeed at.

Our big models apparently siphon up large chunks of the web and then get hammered in the end by RLHF(*) to play nice. This looks like yet another instance of trying to make a system safe and secure after the fact rather than carefully building it that way in the first place. (For decades we’ve known: Garbage In Garbage Out. The curation of training data is well known to be quite opaque.)

(*) Reinforcement Learning from Human Feedback (RLHF) is a technique used in training large language models (LLMs) to align their behavior with human preferences. It involves using human feedback to guide the model’s learning process.

I cannot understand why the generative AI safety discussion has, from the start with no indication of reassessment or exploration of alternatives, been focused on this single approach that has continued to be troublesome. Just off the cuff, many other alternatives can be imagined:

  • Limit training data to trustworthy content with curation.
  • Use reliable reputational metadata and sentiment analysis to filter out bad inputs.
  • Require references to be checked by reputation as legitimate sources.
  • Fact check training data and also LLM outputs.

A representative post mentions red-teaming as an important mitigation to generative AI harm, but this raises a fundamental question: why try to prevent users from getting generative AI to do harmful things? Clearly we want to protect well-behaved users who comply with usage policy, but why must we stop anyone intent on producing harmful results? Word processors do not attempt to block anyone from writing incendiary diatribes, compilers do not prevent anyone writing malware, and so on — generative AI alone takes on this extraordinarily difficult task, and reasons that I cannot imagine.

To be clear, I am not suggesting that AI should have zero safety precautions — what I question is why all the big platforms have chosen the same questionable means for doing this, without seeming to recognize its considerable downsides and exploring the many alternatives available. I say questionable because of the well known Waluigi effect, where trying to suppress bad behavior has the unwelcome effect of promoting it.

One big downside of the current approach is the system is very sensitive to certain subjects and excessively self-censors which is a significant disservice. For example, I tried to have a discussion about the methodology of US political opinion polls, but failed to get any responses because it triggered a block on politics. The whole idea of a political poll is to conduct it fairly without bias (for it to mean anything) so failing to engage on that topic makes seeing through all the disinformation and misinformation that much harder.

I had an interesting chat with Gemini about this. If interested you can skip to the “Recap” section to see where it ends up.

The LLM nicely argues the de facto standard argument for guardrails, and it was fun discussing because since it isn’t a person I took the liberty of being very direct and not pulling any punches (while remaining civil, but just barely). Some good questions came up, such as: LLMs can write poetry, but isn’t that only if they have been trained on human poems?