Anthropic Engineering:量化 Agentic 编码评测中的基础设施噪声Anthropic Engineering: Quantifying Infrastructure Noise in Agentic Coding Evals
Anthropic Engineering:量化 Agentic 编码评测中的基础设施噪声。仅资源配置就能让 Terminal-Bench 2.0 分数波动高达 6 个百分点——超过许多模型间差距。严格执行(1x 资源)导致基础设施错误率高达 5.8%,而无上限资源通过支持重量级方法使成功率提升 6 个百分点。团队建议为每个任务分别指定保证分配和单独的硬杀阈值,从而在不虚高分数的前提下消除噪声。核心结论:在评测基础设施完全透明前,小于 3 个百分点的排行榜差距应持怀疑态度。
Anthropic Engineering: Quantifying infrastructure noise in agentic coding evals. Resource configuration alone can swing Terminal-Bench 2.0 scores by up to 6 percentage points—larger than many model-to-model differences on leaderboards. Strict enforcement (1x resources) causes high infra error rates (5.8%), while uncapped resources lift success by 6pp by enabling heavyweight approaches. The team recommends specifying both guaranteed allocation and a separate hard-kill threshold per task to neutralize noise without inflating scores. Key takeaway: small leaderboard gaps below 3pp should be viewed with skepticism until eval infrastructure is fully documented. https://www.anthropic.com/engineering/infrastructure-noise
查看原文 →
Anthropic Engineering:长运行应用开发的 Harness 设计Anthropic Engineering: Harness Design for Long-Running Application Development
Anthropic Engineering:长运行应用开发的 Harness 设计。受 GAN 启发的三 Agent(规划器、生成器、评估器)Harness,结合冲刺合约和 Playwright QA,让 Claude 能在多小时自主会话中构建丰富的全栈应用。对于复古游戏制作器,完整 Harness 的输出质量远超单 Agent 基线,尽管成本更高。随着 Opus 4.6 发布,团队去掉了冲刺结构,展示了 Harness 应随模型进步而演进。该方法将主观质量转化为可打分标准,并保持 Agent 在长任务中的连贯性。
Anthropic Engineering: Harness design for long-running application development. A three-agent (planner, generator, evaluator) GAN-inspired harness with sprint contracts and Playwright QA enables Claude to autonomously build rich full-stack apps over multi-hour sessions. For a retro game maker, the full harness produced far superior results than a single-agent baseline despite higher cost. With Opus 4.6 the team simplified by removing sprints, showing how harnesses should evolve as models improve. The approach turns subjective quality into gradable criteria and keeps agents coherent across long tasks. https://www.anthropic.com/engineering/harness-design-long-running-apps
查看原文 →
Claude 通过 Skills 显著提升前端设计质量Claude Improves Frontend Design Through Skills
Claude 博客:Claude 通过 Skills 显著提升前端设计质量。Skills 让 Claude 动态加载专业指导,摆脱“AI slop”默认风格(Inter 字体、紫色渐变)。一个精炼的前端美学 Skill 涵盖排版、主题、动效和背景,大幅提升输出的独特性和精致度。web-artifacts-builder Skill 则支持多文件 React + Tailwind + shadcn/ui 工件,最终打包为单个 HTML。示例显示 SaaS 落地页、博客、仪表盘和交互应用的质量显著提升。
Claude Blog: Improving frontend design through Skills. Skills let Claude dynamically load specialized guidance to escape “AI slop” defaults (Inter fonts, purple gradients). A compact frontend aesthetics skill covering typography, themes, motion, and backgrounds dramatically improves output distinctiveness and polish. A web-artifacts-builder skill further enables multi-file React + Tailwind + shadcn/ui artifacts that bundle into single HTML. Examples show markedly better SaaS landing pages, blogs, dashboards, and interactive apps. https://claude.com/blog/improving-frontend-design-through-skills
查看原文 →
Claude Code 重大 UX 升级与移动集成Claude Code Major UX Upgrades and Mobile Integration
Claude Code 团队成员分享了重大改进。Thariq 宣布使用虚拟视口重写了渲染器,支持鼠标操作、底部提示输入始终可见以及众多小 UX 优化(实验性)。Cat Wu 强调 Claude 移动 App 与本地 CLI 之间可轻松传送会话,在路上捕捉想法后无缝接续。Peter Steinberger 建议完全跳过“计划模式”——直接和 Agent 对话即可获得更好效果。这些更新让 Claude Code 在实时编码和跨设备工作流中更加流畅。
Claude Code team members shared major improvements. Thariq announced a rewritten renderer using virtual viewport for mouse support, persistent bottom prompt input, and numerous small UX wins (experimental). Cat Wu highlighted seamless session teleporting between Claude mobile app and local CLI for ideas captured on the go. Peter Steinberger advised skipping “plan mode” entirely—just talk to the agent for better results. These updates make Claude Code more fluid for real-time coding and cross-device workflows. https://x.com/trq212/status/2039453692592873587 https://x.com/_catwu/status/2039421527935033854 https://x.com/steipete/status/2039551079621566812
查看原文 →查看原文 →查看原文 →
OpenClaw Skills 与任务脑暴革命OpenClaw Skills and Task Braindumping Revolution
构建者 Zara Zhang 分享了顿悟时刻:她现在不再用待办清单,而是把快速任务脑暴给 OpenClaw;Agent 会记录、真正完成任务,并每天早上发送已完成与待关注的报告。她还发布了“Follow builders” Skill,可将 25 个顶级 AI 账号和播客重新混编成个性化每日通讯——已在 GitHub 获得 2000+ 星标。这些工具让 Agent 成为主动的生产力伙伴。
Builder Zara Zhang shared an aha moment: she now braindumps quick tasks to OpenClaw instead of a to-do list; the agent records them, actually completes them, and sends a morning report of what’s done versus what needs attention. She also released the “Follow builders” skill that remixes 25 top AI accounts and podcasts into a personalized daily newsletter—already 2k+ stars on GitHub. These tools turn agents into proactive productivity partners. https://x.com/zarazhangrui/status/2039599038358814961 https://x.com/zarazhangrui/status/2039368866741277074
查看原文 →查看原文 →