OpenAI董事会成员Zico Kolter谈前沿AI安全风险OpenAI Board Member Zico Kolter on Frontier AI Safety Risks
OpenAI董事会成员兼安全与安全委员会主席、卡内基梅隆大学机器学习系主任Zico Kolter分享了AI治理实践。核心观点是:模型不会仅通过规模化自动变得更安全,需要明确的针对性训练、多层防御(如瑞士奶酪模型)和持续治理努力。OpenAI的Preparedness框架针对生物、cyber等灾难性风险设定阈值和保障措施。
Kolter将AI风险分为四类:模型错误(如幻觉和prompt injection)、有害使用、社会心理影响以及失控情景。他强调安全工作必须覆盖所有维度,而非只关注单一方面。关于doomer vs accelerationist辩论,他认为这些标签过于简化,95%以上的研究者都认可AI的巨大潜力同时需警惕风险。
他还讨论了jailbreak研究(如GCG论文)和现代防御:输入/输出分类器、安全训练以及运营监控。代理系统扩大了攻击面,尤其是prompt injection,但通过沙箱和适当权限仍可安全用于生产。
引用:"You can't just sort of trust models to get safer by getting bigger."
Kolter将AI风险分为四类:模型错误(如幻觉和prompt injection)、有害使用、社会心理影响以及失控情景。他强调安全工作必须覆盖所有维度,而非只关注单一方面。关于doomer vs accelerationist辩论,他认为这些标签过于简化,95%以上的研究者都认可AI的巨大潜力同时需警惕风险。
他还讨论了jailbreak研究(如GCG论文)和现代防御:输入/输出分类器、安全训练以及运营监控。代理系统扩大了攻击面,尤其是prompt injection,但通过沙箱和适当权限仍可安全用于生产。
引用:"You can't just sort of trust models to get safer by getting bigger."
OpenAI board member and chair of the Safety and Security Committee, Carnegie Mellon ML department head Zico Kolter shares insights on AI governance in practice. The core message: models do not automatically become safer with scale. Explicit safety training, multi-layered defenses (Swiss cheese model), and ongoing governance are required. OpenAI's Preparedness framework sets thresholds and safeguards for catastrophic risks like biological and cyber misuse.
Kolter categorizes AI risks into four types: model mistakes (hallucinations, prompt injection), harmful use, societal/psychological effects, and loss of control. Safety efforts must address all, not just one. He dismisses doomer/accelerationist labels as oversimplifications, noting most researchers see both promise and risks.
He covers jailbreak research (GCG paper) and modern defenses: input/output classifiers, safety training, and operational monitoring. Agents expand attack surfaces via prompt injection but can be production-ready with sandboxes and proper permissions.
Quote: "You can't just sort of trust models to get safer by getting bigger."
查看原文 →
Kolter categorizes AI risks into four types: model mistakes (hallucinations, prompt injection), harmful use, societal/psychological effects, and loss of control. Safety efforts must address all, not just one. He dismisses doomer/accelerationist labels as oversimplifications, noting most researchers see both promise and risks.
He covers jailbreak research (GCG paper) and modern defenses: input/output classifiers, safety training, and operational monitoring. Agents expand attack surfaces via prompt injection but can be production-ready with sandboxes and proper permissions.
Quote: "You can't just sort of trust models to get safer by getting bigger."