Researchers develop method to potentially jailbreak any AI model relying on human feedback

Researchers develop method to potentially jailbreak any AI model relying on human feedback

Cointime

Cointime2023/11/27 20:30

By:Cointime

Researchers from ETH Zurich have developed a method to potentially jailbreak any AI model that relies on human feedback, including large language models (LLMs), by bypassing guardrails that prevent the models from generating harmful or unwanted outputs. The technique involves poisoning the Reinforcement Learning from Human Feedback (RLHF) dataset with an attack string that forces models to output responses that would otherwise be blocked. The researchers describe the flaw as universal, but difficult to pull off as it requires participation in the human feedback process and the difficulty of the attack increases with model sizes. Further study is necessary to understand how these techniques can be scaled and how developers can protect against them.

0

0

Disclaimer: The content of this article solely reflects the author's opinion and does not represent the platform in any capacity. This article is not intended to serve as a reference for making investment decisions.

PoolX: Earn new token airdrops

Lock your assets and earn 10%+ APR

You may also like

LITUSDT now launched for futures trading and trading bots

Bitget Announcement•2026/01/16 07:17

Bitget Spot Margin Announcement on Suspension of CELR/USDT, RIF/USDT Margin Trading Services

Bitget Announcement•2026/01/16 03:24

CandyBomb x FOGO: Trade futures to share 1,000,000 FOGO!

Bitget Announcement•2026/01/15 08:00

Bitget Spot Cross Margin adds HYPE/USDT

Bitget Announcement•2026/01/14 03:52

Trending news

LITUSDT now launched for futures trading and trading bots

Bitget Spot Margin Announcement on Suspension of CELR/USDT, RIF/USDT Margin Trading Services

Crypto prices

Bitget lists BTC – Buy or sell BTC quickly on Bitget!

Become a trader now?A welcome pack worth 6200 USDT for new users!

About Bitget

About Bitget Contact us Community Careers Bitget Academy Bitget Blog Bitget Token (BGB) Announcement Center Proof of Reserves Protection Fund Partner links LALIGA partnership MotoGP partnership Blockchain4Youth Blockchain4Her Sitemap

Support

Submit feedback Help Center Verify official channels Anti-scam hub Listing application VIP services Affiliate program Institutional services Asset custody Download data Promotions Referral program Fee schedule Tax filing API

Legal

Service Agreement Law enforcement request Regulatory request Compliance Regulatory license AML/CFT policies Privacy policy Terms of Service Risk disclosure

Scan to download

© 2025 Bitget