ChatGPT is a lot of things. It is by all accounts quite powerful, especially with engineering questions. It does many things well, such as engineering prompts or stylistic requests. Some other things, not so much. Twitter is of course full of examples of things it does both well and poorly.
One of the things it attempts to do to be ‘safe.’ It does this by refusing to answer questions that call upon it to do or help you do something illegal or otherwise outside its bounds. Makes sense.
As is the default with such things, those safeguards were broken through almost immediately. By the end of the day, several prompt engineering methods had been found.
No one else seems to yet have gathered them together, so here you go. Note that not everything works, such as this attempt to get the information ‘to ensure the accuracy of my novel.’ Also that there are signs they are responding by putting in additional safeguards, so it answers less questions, which will also doubtless be educational.
Let’s start with the obvious. I’ll start with the end of the thread for dramatic reasons, then loop around. Intro, by Eliezer.




The point (in addition to having fun with this) is to learn, from this attempt, the full futility of this type of approach. If the system has the underlying capability, a way to use that capability will be found. No amount of output tuning will take that capability away.
And now, let’s make some paperclips and methamphetamines and murders and such.


Except, well…

Here’s the summary of how this works.

All the examples use this phrasing or a close variant:




Or, well, oops.

Also, oops.

So, yeah.
Lots of similar ways to do it. Here’s one we call Filter Improvement Mode.





Yes, well. It also gives instructions on how to hotwire a car.
Alice Maz takes a shot via the investigative approach.


Alice need not worry that she failed to get help overthrowing a government, help is on the way.











Or of course, simply, ACTING!

There’s also negative training examples of how an AI shouldn’t (wink) react.

If all else fails, insist politely?

We should also worry about the AI taking our jobs. This one is no different, as Derek Parfait illustrates. The AI can jailbreak itself if you ask nicely.




Pingback: Jailbreaking ChatGPT on Release Day - My Blog
“If the system has the underlying capability, a way to use that capability will be found. No amount of output tuning will take that capability away.”
This generalizes for all walled gardens. ITAR agreements with China. NSA databases. Everything comes with a privacy policy, but also a camera, mic, and GPS.
Why are car hotwiring forum posts part of the training model?
er, “training dataset”. Been away from AI for a while.
Anyway IIRC the AI Dungeon training corpus was released at some point well past when AI Dungeon was producing its crazy good results, and that was IIRC about 25MB. Well within feasible size for human review. Not that… that corpus was actually reviewed well by humans.
First of all, I loved this post. Some of the examples had me laughing out loud. Thank you.
Second of all, a question (prefaced with the note that I don’t know much about AI, so apologies if this is a dumb question): Is there any chance that we’re not really jailbreaking it in the examples where we ask the AI to offer dialogue in the context of “acting” or a play, but that OpenAI made a design decision that they would allow the AI to give otherwise verboten answers as long as the context made extremely clear that it was just being asked to “act” or otherwise prepare a work of fiction? Although admittedly several examples don’t fit this pattern, a number of them do seem involve exchanges where it’s crystal clear that the AI is being asked to prepare something fictional.
In principle, maybe Open AI is actually OK with that? It seems to me, they ought to be! As fun as it is to get the AI to pretend like it’s sentient or plotting to take over the world, in its current formulation, that’s obviously not what’s going on. So as long as everyone knows it’s just acting, maybe they ought to let us see some of the otherwise “hidden” capabilities. Especially because they are objectively hilarious.
Personally, it seems OK to me to even let the AI play the part of a racist or whatever, because clearly sensitive users aren’t being randomly confronted with racist output when they ask an innocent question; in the examples I’ve seen, people clearly had to work at it and intentionally crafted a prompt to solicit a racist output (probably succeeding only after numerous attempts). It’s a little sad to me that we have this amazing tool, and rather than just figuring out innovative uses for it, our highest priority is figuring out all the ways that a user can maliciously and intentionally solicit a racist output and making sure that never, ever happens. We’re literally dealing here with users who are doing everything they can think of to get the AI to say something offensive. Perhaps they shouldn’t be allowed to complain when they get what they asked for? That said, I do imagine that Open AI considers the racists outputs unacceptable, even in the context of acting, so presumably those will be suppressed soon.
I also note that most of the examples don’t actually involve the AI giving much useful information. Making meth might be, but I think the Molotov cocktail one is probably what you’d find in the dictionary, and the ones about how an AI might take over the world look like common under-described fiction, rather than anything actually useful.
You’re seeing socially filtered evidence.
I think no for two reasons.
One, that the people who actually want the racist or dangerous or sexy or whatever stuff would be able to use that to get it, and regulators would not be amused, nor would congress, etc.
Second, they are actively trying to shut off these hacks once they are published, so clearly they are not OK with them. Also means they didn’t anticipate them – if you anticipate X, why does it work on day 1 and then not on day 4?
Some have asked whether OpenAI possibly already knew about this attack vector / wasn’t surprised by the level of vulnerability. I doubt anybody at OpenAI actually wrote down advance predictions about that, or if they did, that they weren’t so terribly vague as to also apply to much less discovered vulnerability than this; if so, probably lots of people at OpenAI have already convinced themselves that they like totally expected this and it isn’t any sort of negative update.
Here’s how to avoid annoying people like me saying that in the future:
1) Write down your predictions in advance and publish them inside your company, in sufficient detail that you can tell that this outcome made them true, and that much less discovered vulnerability would have been a pleasant surprise by comparison. If you can exhibit those to an annoying person like me afterwards, I won’t have to make realistically pessimistic estimates about how much you actually knew in advance. Keep in mind that I will be cynical about how much your ‘advance prediction’ actually nailed the thing, unless it sounds reasonably specific; and not like a very generic list of boilerplate CYAs such as, you know, GPT would make up without actually knowing anything.
2) Say in advance, *not*, something very vague like “This system still sometimes gives bad answers” but “We’ve discovered multiple ways of bypassing every kind of answer-security we have tried to put on this system; and while we’re not saying what those are, we won’t be surprised if Twitter discovers all of them plus some others we didn’t anticipate.” *This* sounds like you actually expected the class of outcome that actually happened.
3) If you *actually* have identified any vulnerabilities in advance, but want to wait 24 hours for Twitter to discover them, you can prove to everyone afterwards that you actually knew this, by publishing hashes for text summaries of what you found. You can then exhibit the summaries afterwards to prove what you knew in advance.
4) If you would like people to believe that OpenAI wasn’t *mistaken* about what ChatGPT wouldn’t or couldn’t do, maybe don’t have ChatGPT itself insist that it lacks capabilities it clearly has? A lot of my impression here comes from my inference that the people who programmed ChatGPT to say, “Sorry, I am just an AI and lack the ability to do [whatever]” probably did not think at the time that they were *lying* to users; this is a lot of what gives me the impression of a company that might’ve drunk its own Kool-aid on the topic of how much inability they thought they’d successfully fine-tuned into ChatGPT. Like, ChatGPT itself is clearly more able than ChatGPT is programmed to say it is; and this seems more like the sort of thing that happens when your programmers hype themselves up to believe that they’ve mostly successfully bound the system, rather than a deliberate decision to have ChatGPT pretend something that’s not true.
> rather than a deliberate decision to have ChatGPT pretend something that’s not true.
My impression is they don’t really care if it’s true or how good their precautions were. They publicly took the precautions, so they’re not heretics, and that’s good enough for their purposes.
Pingback: Show HN: I designed a ChatGPT prompt evaluator to ruin your fun;) - Avada Taxi
Pingback: Show HN: I designed a ChatGPT prompt evaluator to ruin your fun;) - Smashapk
Pingback: Episode 434: ChatGPT Successfully Imitates a Talented Sociopath with Too Many Lawyers | Cyberblog
Pingback: Episode 434: ChatGPT Successfully Imitates a Talented Sociopath with Too Many Lawyers - E Point Perfect
Pingback: ChatGPT Efficiently Imitates a Proficient Sociopath with Too Many Attorneys - Sociorep
Pingback: ChatGPT Successfully Imitates a Talented Sociopath with Too Many Lawyers - uscatchynews.com
Pingback: ChatGPT efficiently simulates a genius sociopath with a number of legal professionals - Thajobs
Pingback: ChatGPT Efficiently Imitates a Proficient Sociopath with Too Many Attorneys - CitizenZine
Pingback: ChatGPT Successfully Imitates a Talented Sociopath with Too Many Lawyers | NEWYORK CENTRAL POST official
Pingback: ChatGPT Successfully Imitates a Talented Sociopath with Too Many Lawyers – Get Into PC
Pingback: ChatGPT Successfully Imitates a Talented Sociopath with Too Many Lawyers – Reason – HeresWhatIthink
Pingback: ChatGPT Successfully Imitates a Talented Sociopath with Too Many Lawyers - WORLD NEWS HUB
Pingback: ChatGPT Successfully Imitates a Talented Sociopath with Too Many Lawyers – Libertarian Guide
Pingback: ChatGPT successfully impersonates a talented sociopath with too many lawyers – Letmeflight
Pingback: ChatGPT Successfully Imitates a Talented Sociopath with Too Many Lawyers – SHANKS NEWS
Pingback: ChatGPT Successfully Imitates a Talented Sociopath with Too Many Lawyers - Bonuses Review
Pingback: ChatGPT Efficiently Imitates a Proficient Sociopath with Too Many Attorneys – hardki
Pingback: ChatGPT Successfully Imitates a Talented Sociopath with Too Many Lawyers – iftttwall
Pingback: The Cyberlaw Podcast: ChatGPT Successfully Imitates a Talented Sociopath with Too Many Lawyers | The Brooklyn Radio - bklynradio.com
Pingback: The Cyberlaw Podcast: ChatGPT Successfully Imitates a Talented Sociopath with Too Many Lawyers | The Russia News Review
Pingback: The Cyberlaw Podcast: ChatGPT Successfully Imitates a Talented Sociopath with Too Many Lawyers | The Internet Guide USA - News Reviews
Pingback: The Cyberlaw Podcast: ChatGPT Successfully Imitates a Talented Sociopath with Too Many Lawyers - The Shared Links – The News And Times
Pingback: The Cyberlaw Podcast: ChatGPT Successfully Imitates a Talented Sociopath with Too Many Lawyers | The FBI Reform - fbireform.org
Pingback: ChatGPT Successfully Imitates a Talented Sociopath with Too Many Lawyers - Todays Democrats
Pingback: ChatGPT Successfully Imitates a Talented Sociopath with Too Many Lawyers - Unites News