최고 LessWrong 팟캐스트 (2025)

1
“Gemini 3 is Evaluation-Paranoid and Contaminated” by null 14:59

1d ago14:59

14:59

TL;DR: Gemini 3 frequently thinks it is in an evaluation when it is not, assuming that all of its reality is fabricated. It can also reliably output the BIG-bench canary string, indicating that Google likely trained on a broad set of benchmark data. Most of the experiments in this post are very easy to replicate, and I encourage people to try. I wr…

1
“Natural emergent misalignment from reward hacking in production RL” by evhub, Monte M, Benjamin Wright, Jonathan Uesato 18:45

3d ago18:45

18:45

Abstract We show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of reward hacking strategies via synthetic document finetuning or prompting, and train on a selection of real Anthropic production coding environm…

1
“Anthropic is (probably) not meeting its RSP security commitments” by habryka 8:57

1d ago8:57

8:57

TLDR: An AI company's model weight security is at most as good as its compute providers' security. Anthropic has committed (with a bit of ambiguity, but IMO not that much ambiguity) to be robust to attacks from corporate espionage teams at companies where it hosts its weights. Anthropic seems unlikely to be robust to those attacks. Hence they are i…

1
“Varieties Of Doom” by jdp 1:38:48

2d ago1:38:48

1:38:48

There has been a lot of talk about "p(doom)"over the last few years. This has always rubbed me the wrong waybecause "p(doom)" didn't feel like it mapped to any specific belief in my head.In private conversations I'd sometimes give my p(doom) as 12%, with the caveatthat "doom" seemed nebulous and conflated between several different concepts.At some …

1
“How Colds Spread” by RobertM 20:31

5d ago20:31

20:31

It seems like a catastrophic civilizational failure that we don't have confident common knowledge of how colds spread. There have been a number of studies conducted over the years, but most of those were testing secondary endpoints, like how long viruses would survive on surfaces, or how likely they were to be transmitted to people's fingers after …

1
“New Report: An International Agreement to Prevent the Premature Creation of Artificial Superintelligence” by Aaron_Scher, David Abecassis, Brian Abeyta, peterbarnett 6:52

6d ago6:52

6:52

TLDR: We at the MIRI Technical Governance Team have released a report describing an example international agreement to halt the advancement towards artificial superintelligence. The agreement is centered around limiting the scale of AI training, and restricting certain AI research. Experts argue that the premature development of artificial superint…

1
“Where is the Capital? An Overview” by johnswentworth 18:06

5d ago18:06

18:06

When a new dollar goes into the capital markets, after being bundled and securitized and lent several times over, where does it end up? When society's total savings increase, what capital assets do those savings end up invested in? When economists talk about “capital assets”, they mean things like roads, buildings and machines. When I read through …

1
“Problems I’ve Tried to Legibilize” by Wei Dai 4:17

7d ago4:17

4:17

Looking back, it appears that much of my intellectual output could be described as legibilizing work, or trying to make certain problems in AI risk more legible to myself and others. I've organized the relevant posts and comments into the following list, which can also serve as a partial guide to problems that may need to be further legibilized, es…

1
“Do not hand off what you cannot pick up” by habryka 6:39

7d ago6:39

6:39

Delegation is good! Delegation is the foundation of civilization! But in the depths of delegation madness breeds and evil rises. In my experience, there are three ways in which delegation goes off the rails: 1. You delegate without knowing what good performance on a task looks like If you do not know how to evaluate performance on a task, you are g…

1
“7 Vicious Vices of Rationalists” by Ben Pace 9:47

5d ago9:47

9:47

Vices aren't behaviors that one should never do. Rather, vices are behaviors that are fine and pleasurable to do in moderation, but tempting to do in excess. The classical vices are actually good in part. Moderate amounts of gluttony is just eating food, which is important. Moderate amounts of envy is just "wanting things", which is a motivator of …

1
“Tell people as early as possible it’s not going to work out” by habryka 3:19

7d ago3:19

3:19

Context: Post #4 in my sequence of private Lightcone Infrastructure memos edited for public consumption This week's principle is more about how I want people at Lightcone to relate to community governance than it is about our internal team culture. As part of our jobs at Lightcone we often are in charge of determining access to some resource, or me…

1
“Everyone has a plan until they get lied to the face” by Screwtape 12:48

7d ago12:48

12:48

"Everyone has a plan until they get punched in the face." - Mike Tyson (The exact phrasing of that quote changes, this is my favourite.) I think there is an open, important weakness in many people. We assume those we communicate with are basically trustworthy. Further, I think there is an important flaw in the current rationality community. We spen…

1
“Please, Don’t Roll Your Own Metaethics” by Wei Dai 4:11

10d ago4:11

4:11

One day, when I was an interning at the cryptography research department of a large software company, my boss handed me an assignment to break a pseudorandom number generator passed to us for review. Someone in another department invented it and planned to use it in their product, and wanted us to take a look first. This person must have had a lot …

1
“Paranoia rules everything around me” by habryka 22:32

9d ago22:32

22:32

People sometimes make mistakes [citation needed]. The obvious explanation for most of those mistakes is that decision makers do not have access to the information necessary to avoid the mistake, or are not smart/competent enough to think through the consequences of their actions. This predicts that as decision-makers get access to more information,…

1
“Human Values ≠ Goodness” by johnswentworth 11:31

12d ago11:31

11:31

There is a temptation to simply define Goodness as Human Values, or vice versa. Alas, we do not get to choose the definitions of commonly used words; our attempted definitions will simply be wrong. Unless we stick to mathematics, we will end up sneaking in intuitions which do not follow from our so-called definitions, and thereby mislead ourselves.…

1
“Condensation” by abramdemski 30:29

12d ago30:29

30:29

Condensation: a theory of concepts is a model of concept-formation by Sam Eisenstat. Its goals and methods resemble John Wentworth's natural abstractions/natural latents research.[1] Both theories seek to provide a clear picture of how to posit latent variables, such that once someone has understood the theory, they'll say "yep, I see now, that's h…

1
“Mourning a life without AI” by Nikola Jurkovic 11:17

14d ago11:17

11:17

Recently, I looked at the one pair of winter boots I own, and I thought “I will probably never buy winter boots again.” The world as we know it probably won’t last more than a decade, and I live in a pretty warm area. I. AGI is likely in the next decade It has basically become consensus within the AI research community that AI will surpass human ca…

1
“Unexpected Things that are People” by Ben Goldhaber 8:13

13d ago8:13

8:13

Cross-posted from https://bengoldhaber.substack.com/ It's widely known that Corporations are People. This is universally agreed to be a good thing; I list Target as my emergency contact and I hope it will one day be the best man at my wedding. But there are other, less well known non-human entities that have also been accorded the rank of person. S…

1
“Sonnet 4.5’s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals” by Alexa Pan, ryan_greenblatt 35:57

17d ago35:57

35:57

According to the Sonnet 4.5 system card, Sonnet 4.5 is much more likely than Sonnet 4 to mention in its chain-of-thought that it thinks it is being evaluated; this seems to meaningfully cause it to appear to behave better in alignment evaluations. So, Sonnet 4.5's behavioral improvements in these evaluations may partly be driven by growing tendency…

1
“Publishing academic papers on transformative AI is a nightmare” by Jakub Growiec 7:23

18d ago7:23

7:23

I am a professor of economics. Throughout my career, I was mostly working on economic growth theory, and this eventually brought me to the topic of transformative AI / AGI / superintelligence. Nowadays my work focuses mostly on the promises and threats of this emerging disruptive technology. Recently, jointly with Klaus Prettner, we’ve written a pa…

1
“The Unreasonable Effectiveness of Fiction” by Raelifin 15:03

18d ago15:03

15:03

[Meta: This is Max Harms. I wrote a novel about China and AGI, which comes out today. This essay from my fiction newsletter has been slightly modified for LessWrong.] In the summer of 1983, Ronald Reagan sat down to watch the film War Games, starring Matthew Broderick as a teen hacker. In the movie, Broderick's character accidentally gains access t…

1
“Legible vs. Illegible AI Safety Problems” by Wei Dai 3:29

17d ago3:29

3:29

Some AI safety problems are legible (obvious or understandable) to company leaders and government policymakers, implying they are unlikely to deploy or allow deployment of an AI while those problems remain open (i.e., appear unsolved according to the information they have access to). But some problems are illegible (obscure or hard to understand, o…

1
“Lack of Social Grace is a Lack of Skill” by Screwtape 11:08

20d ago11:08

11:08

1. I have claimed that one of the fundamental questions of rationality is “what am I about to do and what will happen next?” One of the domains I ask this question the most is in social situations. There are a great many skills in the world. If I had the time and resources to do so, I’d want to master all of them. Wilderness survival, automotive re…

1
[Linkpost] “I ate bear fat with honey and salt flakes, to prove a point” by aggliu 1:07

20d ago1:07

1:07

This is a link post. Eliezer Yudkowsky did not exactly suggest that you should eat bear fat covered with honey and sprinkled with salt flakes. What he actually said was that an alien, looking from the outside at evolution, would predict that you would want to eat bear fat covered with honey and sprinkled with salt flakes. Still, I decided to buy a …

1
“What’s up with Anthropic predicting AGI by early 2027?” by ryan_greenblatt 39:25

18d ago39:25

39:25

As far as I'm aware, Anthropic is the only AI company with official AGI timelines[1]: they expect AGI by early 2027. In their recommendations (from March 2025) to the OSTP for the AI action plan they say: As our CEO Dario Amodei writes in 'Machines of Loving Grace', we expect powerful AI systems will emerge in late 2026 or early 2027. Powerful AI s…

들어볼 가치가 있는 팟캐스트

LessWrong 팟 캐스트

들어볼 가치가 있는 팟캐스트

1
LessWrong (Curated & Popular)

LessWrong

1
“Gemini 3 is Evaluation-Paranoid and Contaminated” by null 14:59

1
“Natural emergent misalignment from reward hacking in production RL” by evhub, Monte M, Benjamin Wright, Jonathan Uesato 18:45

1
“Anthropic is (probably) not meeting its RSP security commitments” by habryka 8:57

1
“Varieties Of Doom” by jdp 1:38:48

1
“How Colds Spread” by RobertM 20:31

1
“New Report: An International Agreement to Prevent the Premature Creation of Artificial Superintelligence” by Aaron_Scher, David Abecassis, Brian Abeyta, peterbarnett 6:52

1
“Where is the Capital? An Overview” by johnswentworth 18:06

1
“Problems I’ve Tried to Legibilize” by Wei Dai 4:17

1
“Do not hand off what you cannot pick up” by habryka 6:39

1
“7 Vicious Vices of Rationalists” by Ben Pace 9:47

1
“Tell people as early as possible it’s not going to work out” by habryka 3:19

1
“Everyone has a plan until they get lied to the face” by Screwtape 12:48

1
“Please, Don’t Roll Your Own Metaethics” by Wei Dai 4:11

1
“Paranoia rules everything around me” by habryka 22:32

1
“Human Values ≠ Goodness” by johnswentworth 11:31

1
“Condensation” by abramdemski 30:29

1
“Mourning a life without AI” by Nikola Jurkovic 11:17

1
“Unexpected Things that are People” by Ben Goldhaber 8:13

1
“Sonnet 4.5’s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals” by Alexa Pan, ryan_greenblatt 35:57

1
“Publishing academic papers on transformative AI is a nightmare” by Jakub Growiec 7:23

1
“The Unreasonable Effectiveness of Fiction” by Raelifin 15:03

1
“Legible vs. Illegible AI Safety Problems” by Wei Dai 3:29

1
“Lack of Social Grace is a Lack of Skill” by Screwtape 11:08

1
[Linkpost] “I ate bear fat with honey and salt flakes, to prove a point” by aggliu 1:07

1
“What’s up with Anthropic predicting AGI by early 2027?” by ryan_greenblatt 39:25

빠른 참조 가이드