The LLM doesn’t have to innately implement filtering. You can use a more traditional and concrete filtering strategy on top. So you sneak something problematic by in the prompt and it’s too clever to be caught by the input filter, but then on the output the filter can catch that the prompt tricked the LLM into generating something undesired. Another comment specified they tried this and it started to work but then suddenly it seemingly shut out the reply in the middle, presumably the minute the LLM spit something at a more traditional filter and that shut it down.
I think I’ve seen this sort of approach has been applied to largely mask embarassing answers that become memes, or to detect input known not to work, and to shut it down or redirect it to a better facility (e.g. redirecting math to wolfram alpha).
The LLM doesn’t have to innately implement filtering. You can use a more traditional and concrete filtering strategy on top. So you sneak something problematic by in the prompt and it’s too clever to be caught by the input filter, but then on the output the filter can catch that the prompt tricked the LLM into generating something undesired. Another comment specified they tried this and it started to work but then suddenly it seemingly shut out the reply in the middle, presumably the minute the LLM spit something at a more traditional filter and that shut it down.
I think I’ve seen this sort of approach has been applied to largely mask embarassing answers that become memes, or to detect input known not to work, and to shut it down or redirect it to a better facility (e.g. redirecting math to wolfram alpha).