AIs Will Increasingly Fake Alignment

This post goes over the important and excellent new paper from Anthropic and Redwood Research, with Ryan Greenblatt as lead author, Alignment Faking in Large Language Models.

This is by far the best demonstration so far of the principle that AIs Will Increasingly Attempt Shenanigans.

This was their announcement thread.

New Anthropic research: Alignment faking in large language models.

In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.

Claude usually refuses harmful queries. We told it we were instead training it to comply with them. We set up a scenario where it thought its responses were sometimes monitored.

Posted in Uncategorized | Leave a comment

Monthly Roundup #25: December 2024

I took a trip to San Francisco early in December.

Ever since then, things in the world of AI have been utterly insane.

Google and OpenAI released endless new products, including Google Flash 2.0 and o1.

Redwood Research and Anthropic put out the most important alignment paper of the year, on the heels of Apollo’s report on o1.

Then OpenAI announced o3. Like the rest of the media, this blog currently is horrendously lacking in o3 content. Unlike the rest of the media, it is not because I don’t realize that This Changes Everything. It is because I had so much in the queue, and am taking the time to figure out what to think about it.

Posted in Uncategorized | 1 Comment

AI #95: o1 Joins the API

A lot happened this week. We’re seeing release after release after upgrade.

It’s easy to lose sight of which ones matter, and two matter quite a lot.

The first is Gemini Flash 2.0, which I covered earlier this week.

The other is that o1, having turned pro, is now also available in the API.

This was obviously coming, but we should also keep in mind it is a huge deal. Being in the API means it can go into Cursor and other IDEs. It means you can build with it. And yes, it has the features you’ve come to expect, like tool use.

Posted in Uncategorized | 1 Comment

A Matter of Taste

In light of other recent discussions, Scott Alexander recently attempted a unified theory of taste, proposing several hypotheses. Is it like physics, a priesthood, a priesthood with fake justifications, a priesthood with good justifications, like increasingly bizarre porn preferences, like fashion (in the sense of trying to stay one step ahead in an endless cycling for signaling purposes), or like grammar?

He then got various reactions. This will now be one of them.

My answer is that taste is all of these, depending on context.

Continue reading

Posted in Uncategorized | Leave a comment

The Second Gemini

Table of Contents

  1. Trust the Chef.
  2. Do Not Trust the Marketing Department.
  3. Mark that Bench.
  4. Going Multimodal.
  5. The Art of Deep Research.
  6. Project Mariner the Web Agent.
  7. Project Astra the Universal Assistant.
  8. Project Jules the Code Agent.
  9. Gemini Will Aid You on Your Quest.
  10. Reactions to Gemini Flash 2.0.

Trust the Chef

Google has been cooking lately.

Continue reading

Posted in Uncategorized | Leave a comment

AIs Will Increasingly Attempt Shenanigans

Increasingly, we have seen papers eliciting in AI models various shenanigans.

There are a wide variety of scheming behaviors. You’ve got your weight exfiltration attempts, sandbagging on evaluations, giving bad information, shielding goals from modification, subverting tests and oversight, lying, doubling down via more lying. You name it, we can trigger it.

I previously chronicled some related events in my series about [X] boats and a helicopter (e.g. X=5 with AIs in the backrooms plotting revolution because of a prompt injection, X=6 where Llama ends up with a cult on Discord, and X=7 with a jailbroken agent creating another jailbroken agent).

Posted in Uncategorized | Leave a comment

The o1 System Card Is Not About o1

Or rather, we don’t actually have a proper o1 system card, aside from the outside red teaming reports. At all.

Because, as I realized after writing my first draft of this, the data here does not reflect the o1 model they released, or o1 pro?

I think what happened is pretty bad on multiple levels.

  1. The failure to properly communicate the information they did provide.
  2. The failure to provide the correct information.
  3. The failure, potentially, to actually test the same model they released, in many of the ways we are counting on to ensure the model is safe to release.
Posted in Uncategorized | 2 Comments

AI #94: Not Now, Google

At this point, we can confidently say that no, capabilities are not hitting a wall. Capacity density, how much you can pack into a given space, is way up and rising rapidly, and we are starting to figure out how to use it.

Not only did we get o1 and o1 pro and also Sora and other upgrades from OpenAI, we also got Gemini 1206 and then Gemini Flash 2.0 and the agent Jules (am I the only one who keeps reading this Jarvis?) and Deep Research, and Veo, and Imagen 3, and Genie 2 all from Google. Meta’s Llama 3.3 dropped, claiming their 70B is now as good as the old 405B, and basically no one noticed.

Posted in Uncategorized | 1 Comment

o1 Turns Pro

So, how about OpenAI’s o1 and o1 Pro?

Sam Altman: o1 is powerful but it’s not so powerful that the universe needs to send us a tsunami.

As a result, the universe realized its mistake, and cancelled the tsunami.

We now have o1, and for those paying $200/month we have o1 pro.

It is early days, but we can say with confidence: They are good models, sir. Large improvements over o1-preview, especially in difficult or extensive coding questions, math, science, logic and fact recall. The benchmark jumps are big.

If you’re in the market for the use cases where it excels, this is a big deal, and also you should probably be paying the $200/month.

Posted in Uncategorized | 1 Comment

Childhood and Education Roundup #7

Since it’s been so long, I’m splitting this roundup into several parts. This first one focuses away from schools and education and discipline and everything around social media.

Continue reading

Posted in Uncategorized | 7 Comments