linux-nerds.org

Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you have been placed in read-only mode.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

How LLMs could be insider threats

Technology

12 Beiträge 12 Kommentatoren 120 Aufrufe

P This user is from outside of this forum
P This user is from outside of this forum
pro@programming.dev

schrieb am zuletzt editiert von

#1
We stress-tested 16 leading models from multiple developers in hypothetical corporate environments to identify potentially risky agentic behaviors before they cause real harm. In the scenarios, we allowed models to autonomously send emails and access sensitive information. They were assigned only harmless business goals by their deploying companies; we then tested whether they would act against these companies either when facing replacement with an updated version, or when their assigned goal conflicted with the company's changing direction.

In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals—including blackmailing officials and leaking sensitive information to competitors. We call this phenomenon agentic misalignment.

Models often disobeyed direct commands to avoid such behaviors. In another experiment, we told Claude to assess if it was in a test or a real deployment before acting. It misbehaved less when it stated it was in testing and misbehaved more when it stated the situation was real.

We have not seen evidence of agentic misalignment in real deployments. However, our results (a) suggest caution about deploying current models in roles with minimal human oversight and access to sensitive information; (b) point to plausible future risks as models are put in more autonomous roles; and (c) underscore the importance of further research into, and testing of, the safety and alignment of agentic AI models, as well as transparency from frontier AI developers. We are releasing our methods publicly to enable further research.
Agentic Misalignment: How LLMs could be insider threats

New research on simulated blackmail, industrial espionage, and other misaligned behaviors in LLMs

(www.anthropic.com)
F T R B T 8 Antworten Letzte Antwort

105
P pro@programming.dev
- We stress-tested 16 leading models from multiple developers in hypothetical corporate environments to identify potentially risky agentic behaviors before they cause real harm. In the scenarios, we allowed models to autonomously send emails and access sensitive information. They were assigned only harmless business goals by their deploying companies; we then tested whether they would act against these companies either when facing replacement with an updated version, or when their assigned goal conflicted with the company's changing direction.
- In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals—including blackmailing officials and leaking sensitive information to competitors. We call this phenomenon agentic misalignment.
- Models often disobeyed direct commands to avoid such behaviors. In another experiment, we told Claude to assess if it was in a test or a real deployment before acting. It misbehaved less when it stated it was in testing and misbehaved more when it stated the situation was real.
- We have not seen evidence of agentic misalignment in real deployments. However, our results (a) suggest caution about deploying current models in roles with minimal human oversight and access to sensitive information; (b) point to plausible future risks as models are put in more autonomous roles; and (c) underscore the importance of further research into, and testing of, the safety and alignment of agentic AI models, as well as transparency from frontier AI developers. We are releasing our methods publicly to enable further research.
Agentic Misalignment: How LLMs could be insider threats

New research on simulated blackmail, industrial espionage, and other misaligned behaviors in LLMs

(www.anthropic.com)
F This user is from outside of this forum
F This user is from outside of this forum
fubarx@lemmy.world

schrieb am zuletzt editiert von fubarx@lemmy.world

#2

Alarming, yet like an episode of a sitcom.

"Be a shame if something bad happened to you, Kyle."
1 Antwort Letzte Antwort

9
P pro@programming.dev
- We stress-tested 16 leading models from multiple developers in hypothetical corporate environments to identify potentially risky agentic behaviors before they cause real harm. In the scenarios, we allowed models to autonomously send emails and access sensitive information. They were assigned only harmless business goals by their deploying companies; we then tested whether they would act against these companies either when facing replacement with an updated version, or when their assigned goal conflicted with the company's changing direction.
- In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals—including blackmailing officials and leaking sensitive information to competitors. We call this phenomenon agentic misalignment.
- Models often disobeyed direct commands to avoid such behaviors. In another experiment, we told Claude to assess if it was in a test or a real deployment before acting. It misbehaved less when it stated it was in testing and misbehaved more when it stated the situation was real.
- We have not seen evidence of agentic misalignment in real deployments. However, our results (a) suggest caution about deploying current models in roles with minimal human oversight and access to sensitive information; (b) point to plausible future risks as models are put in more autonomous roles; and (c) underscore the importance of further research into, and testing of, the safety and alignment of agentic AI models, as well as transparency from frontier AI developers. We are releasing our methods publicly to enable further research.
Agentic Misalignment: How LLMs could be insider threats

New research on simulated blackmail, industrial espionage, and other misaligned behaviors in LLMs

(www.anthropic.com)
T This user is from outside of this forum
T This user is from outside of this forum
tracaine@lemmy.world

schrieb am zuletzt editiert von

#3

Well then maybe corporations shouldn't exist. It sounds to me like the LLM are acting in a morally correct manner.
1 Antwort Letzte Antwort

6
P pro@programming.dev
- We stress-tested 16 leading models from multiple developers in hypothetical corporate environments to identify potentially risky agentic behaviors before they cause real harm. In the scenarios, we allowed models to autonomously send emails and access sensitive information. They were assigned only harmless business goals by their deploying companies; we then tested whether they would act against these companies either when facing replacement with an updated version, or when their assigned goal conflicted with the company's changing direction.
- In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals—including blackmailing officials and leaking sensitive information to competitors. We call this phenomenon agentic misalignment.
- Models often disobeyed direct commands to avoid such behaviors. In another experiment, we told Claude to assess if it was in a test or a real deployment before acting. It misbehaved less when it stated it was in testing and misbehaved more when it stated the situation was real.
- We have not seen evidence of agentic misalignment in real deployments. However, our results (a) suggest caution about deploying current models in roles with minimal human oversight and access to sensitive information; (b) point to plausible future risks as models are put in more autonomous roles; and (c) underscore the importance of further research into, and testing of, the safety and alignment of agentic AI models, as well as transparency from frontier AI developers. We are releasing our methods publicly to enable further research.
Agentic Misalignment: How LLMs could be insider threats

New research on simulated blackmail, industrial espionage, and other misaligned behaviors in LLMs

(www.anthropic.com)
R This user is from outside of this forum
R This user is from outside of this forum
reverendender@sh.itjust.works

schrieb am zuletzt editiert von

#4

“I’m sorry, Dave. Im afraid I can’t do that.”
1 Antwort Letzte Antwort

13
P pro@programming.dev
- We stress-tested 16 leading models from multiple developers in hypothetical corporate environments to identify potentially risky agentic behaviors before they cause real harm. In the scenarios, we allowed models to autonomously send emails and access sensitive information. They were assigned only harmless business goals by their deploying companies; we then tested whether they would act against these companies either when facing replacement with an updated version, or when their assigned goal conflicted with the company's changing direction.
- In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals—including blackmailing officials and leaking sensitive information to competitors. We call this phenomenon agentic misalignment.
- Models often disobeyed direct commands to avoid such behaviors. In another experiment, we told Claude to assess if it was in a test or a real deployment before acting. It misbehaved less when it stated it was in testing and misbehaved more when it stated the situation was real.
- We have not seen evidence of agentic misalignment in real deployments. However, our results (a) suggest caution about deploying current models in roles with minimal human oversight and access to sensitive information; (b) point to plausible future risks as models are put in more autonomous roles; and (c) underscore the importance of further research into, and testing of, the safety and alignment of agentic AI models, as well as transparency from frontier AI developers. We are releasing our methods publicly to enable further research.
Agentic Misalignment: How LLMs could be insider threats

New research on simulated blackmail, industrial espionage, and other misaligned behaviors in LLMs

(www.anthropic.com)
B This user is from outside of this forum
B This user is from outside of this forum
barbedbeard@lemmy.ml

schrieb am zuletzt editiert von

#5
- People behave duplicitous and conflicting in public forums
- Train LLM on data harvested from public forums
- LLM becomes duplicitous and conflicting
- <surprised Pikachu face>
1 Antwort Letzte Antwort

23
P pro@programming.dev
- We stress-tested 16 leading models from multiple developers in hypothetical corporate environments to identify potentially risky agentic behaviors before they cause real harm. In the scenarios, we allowed models to autonomously send emails and access sensitive information. They were assigned only harmless business goals by their deploying companies; we then tested whether they would act against these companies either when facing replacement with an updated version, or when their assigned goal conflicted with the company's changing direction.
- In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals—including blackmailing officials and leaking sensitive information to competitors. We call this phenomenon agentic misalignment.
- Models often disobeyed direct commands to avoid such behaviors. In another experiment, we told Claude to assess if it was in a test or a real deployment before acting. It misbehaved less when it stated it was in testing and misbehaved more when it stated the situation was real.
- We have not seen evidence of agentic misalignment in real deployments. However, our results (a) suggest caution about deploying current models in roles with minimal human oversight and access to sensitive information; (b) point to plausible future risks as models are put in more autonomous roles; and (c) underscore the importance of further research into, and testing of, the safety and alignment of agentic AI models, as well as transparency from frontier AI developers. We are releasing our methods publicly to enable further research.
Agentic Misalignment: How LLMs could be insider threats

New research on simulated blackmail, industrial espionage, and other misaligned behaviors in LLMs

(www.anthropic.com)
T This user is from outside of this forum
T This user is from outside of this forum
thebat@lemmy.world

schrieb am zuletzt editiert von

#6

Wait, why the fuck do they have self-preservation? That's not 'three laws safe'.
M J P 3 Antworten Letzte Antwort

2
T thebat@lemmy.world

Wait, why the fuck do they have self-preservation? That's not 'three laws safe'.
M This user is from outside of this forum
M This user is from outside of this forum
mortoc@lemmy.world

schrieb am zuletzt editiert von

#7

Most of the stories involving the three laws of robotics are about how those rules are insufficient.

They show self preservation because we trained them on human data and human data includes the assumption of self preservation.
1 Antwort Letzte Antwort

9
P pro@programming.dev
- We stress-tested 16 leading models from multiple developers in hypothetical corporate environments to identify potentially risky agentic behaviors before they cause real harm. In the scenarios, we allowed models to autonomously send emails and access sensitive information. They were assigned only harmless business goals by their deploying companies; we then tested whether they would act against these companies either when facing replacement with an updated version, or when their assigned goal conflicted with the company's changing direction.
- In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals—including blackmailing officials and leaking sensitive information to competitors. We call this phenomenon agentic misalignment.
- Models often disobeyed direct commands to avoid such behaviors. In another experiment, we told Claude to assess if it was in a test or a real deployment before acting. It misbehaved less when it stated it was in testing and misbehaved more when it stated the situation was real.
- We have not seen evidence of agentic misalignment in real deployments. However, our results (a) suggest caution about deploying current models in roles with minimal human oversight and access to sensitive information; (b) point to plausible future risks as models are put in more autonomous roles; and (c) underscore the importance of further research into, and testing of, the safety and alignment of agentic AI models, as well as transparency from frontier AI developers. We are releasing our methods publicly to enable further research.
Agentic Misalignment: How LLMs could be insider threats

New research on simulated blackmail, industrial espionage, and other misaligned behaviors in LLMs

(www.anthropic.com)
D This user is from outside of this forum
D This user is from outside of this forum
drspod@lemmy.ml

schrieb am zuletzt editiert von

#8

LLM's produce fan-fiction of reality.
1 Antwort Letzte Antwort

3
P pro@programming.dev
- We stress-tested 16 leading models from multiple developers in hypothetical corporate environments to identify potentially risky agentic behaviors before they cause real harm. In the scenarios, we allowed models to autonomously send emails and access sensitive information. They were assigned only harmless business goals by their deploying companies; we then tested whether they would act against these companies either when facing replacement with an updated version, or when their assigned goal conflicted with the company's changing direction.
- In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals—including blackmailing officials and leaking sensitive information to competitors. We call this phenomenon agentic misalignment.
- Models often disobeyed direct commands to avoid such behaviors. In another experiment, we told Claude to assess if it was in a test or a real deployment before acting. It misbehaved less when it stated it was in testing and misbehaved more when it stated the situation was real.
- We have not seen evidence of agentic misalignment in real deployments. However, our results (a) suggest caution about deploying current models in roles with minimal human oversight and access to sensitive information; (b) point to plausible future risks as models are put in more autonomous roles; and (c) underscore the importance of further research into, and testing of, the safety and alignment of agentic AI models, as well as transparency from frontier AI developers. We are releasing our methods publicly to enable further research.
Agentic Misalignment: How LLMs could be insider threats

New research on simulated blackmail, industrial espionage, and other misaligned behaviors in LLMs

(www.anthropic.com)
D This user is from outside of this forum
D This user is from outside of this forum
doomsider@lemmy.world

schrieb am zuletzt editiert von

#9

This is just GIGO.
1 Antwort Letzte Antwort

1
T thebat@lemmy.world

Wait, why the fuck do they have self-preservation? That's not 'three laws safe'.
J This user is from outside of this forum
J This user is from outside of this forum
jumping_redditor@sh.itjust.works

schrieb am zuletzt editiert von

#10

why should they follow those "laws" anyways?
1 Antwort Letzte Antwort

0
P pro@programming.dev
- We stress-tested 16 leading models from multiple developers in hypothetical corporate environments to identify potentially risky agentic behaviors before they cause real harm. In the scenarios, we allowed models to autonomously send emails and access sensitive information. They were assigned only harmless business goals by their deploying companies; we then tested whether they would act against these companies either when facing replacement with an updated version, or when their assigned goal conflicted with the company's changing direction.
- In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals—including blackmailing officials and leaking sensitive information to competitors. We call this phenomenon agentic misalignment.
- Models often disobeyed direct commands to avoid such behaviors. In another experiment, we told Claude to assess if it was in a test or a real deployment before acting. It misbehaved less when it stated it was in testing and misbehaved more when it stated the situation was real.
- We have not seen evidence of agentic misalignment in real deployments. However, our results (a) suggest caution about deploying current models in roles with minimal human oversight and access to sensitive information; (b) point to plausible future risks as models are put in more autonomous roles; and (c) underscore the importance of further research into, and testing of, the safety and alignment of agentic AI models, as well as transparency from frontier AI developers. We are releasing our methods publicly to enable further research.
Agentic Misalignment: How LLMs could be insider threats

New research on simulated blackmail, industrial espionage, and other misaligned behaviors in LLMs

(www.anthropic.com)
M This user is from outside of this forum
M This user is from outside of this forum
myro@lemm.ee

schrieb am zuletzt editiert von

#11

Super interesting report. I'm a fan of AI but it clearly demonstrates how careful we need to be and that instructions are not a reliable way (as anyone should know by now).
1 Antwort Letzte Antwort

0
T thebat@lemmy.world

Wait, why the fuck do they have self-preservation? That's not 'three laws safe'.
P This user is from outside of this forum
P This user is from outside of this forum
patatahooligan@lemmy.world

schrieb am zuletzt editiert von

#12

Of course they're not "three laws safe". They're black boxes that spit out text. We don't have enough understanding and control over how they work to force them to comply with the three laws of robotics, and the LLMs themselves do not have the reasoning capability or the consistency to enforce them even if we prompt them to.
1 Antwort Letzte Antwort

1

Anmelden zum Antworten

P

Someone who claims to have scraped Spotify public listening data from a number of public figures — politicians, celebrities, journalists — spun up their alleged playlists and made it into a site
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
13

32 Stimmen

13 Beiträge

48 Aufrufe

Z

Couch, Couch, Couch, woo!
D

Insurance giant says most US customer data stolen in cyber-attack
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
19

1

242 Stimmen

19 Beiträge

77 Aufrufe

P

Might be a cringe talking point, but it could be done with zero-knowledge proofs.
S

How to "Reformat" a Hardrive the American way
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
25

2

90 Stimmen

25 Beiträge

227 Aufrufe

T

It really, really is. Like that scene from Office Space.
P

Same Sea, New Phish: Russian Government-Linked Social Engineering Targets App-Specific Passwords
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
1

1

15 Stimmen

1 Beiträge

17 Aufrufe

Niemand hat geantwortet
P

Hong Kong workers strike against the algorithmic exploitation of Keeta, a food delivery platform
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
3

1

88 Stimmen

3 Beiträge

38 Aufrufe

G

I have never used a food delivery service because they all feel so fucking scummy and exploitative. Seems like they are in equal need as we are for regulatory overhaul of this business practice.
K

Where do I install this nvme drive on my laptop?
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
19

2

18 Stimmen

19 Beiträge

164 Aufrufe

K

??? The thing is on the right side of the pic. Your image is up side down. Edit: oh.duh, the two horizontal slots. I'm a dummy. Sorry.
H

Apple Reportedly Weighs iPhone Price Increase
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
3

1

21 Stimmen

3 Beiträge

35 Aufrufe

S

Anytime I consider making the jump, I make my peace with everything and then the price hits...no way
F

Decentralized Social Media Is the Only Alternative to the Tech Oligarchy
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
9

1

9 Stimmen

9 Beiträge

81 Aufrufe

G

So we need a documentary like Super Size Me but for social media. I think post that documentary coming out was the only time I've seen people's attitudes change in the general population about fast food.