AI in Production
2,213 articles total
AI in Production
Yelp is making its AI chatbot way more useful
Yelp is giving its chatbot assistant a major upgrade, turning the platform into something closer to a digital concierge with a suite of new features designed for "getting things done." The move, one of several AI-focused updates in recent months, is part of a broader industry push to make AI more relevant and practically useful […]
How to Ground a Korean AI Agent in Real Demographics with Synthetic Personas
Adding a new content type to my blog-to-newsletter tool
<p><em><a href="https://simonwillison.net/guides/agentic-engineering-patterns/">Agentic Engineering Patterns</a> ></em></p> <p>Here's an example of a deceptively short prompt that got a quite a lot of work done in a single shot.</p> <p>First, some background. I send out a <a href="https://simonw.substack.com/">free Substack newsletter</a> around once a week containing content copied-and-pasted from my blog. I'm effectively using Substack as a lightweight way to allow people to subscribe to my bl
Dairy Queen is putting an AI chatbot in its drive-thrus
Dairy Queen is becoming the latest fast food chain to get in on AI, as it's bringing a chatbot to dozens of its drive-thrus across the US and Canada. It aims to help speed up drive-thru service and "encourage customers to add more food to their orders," according to The Wall Street Journal. Following a […]
Ecom-RLVE: Adaptive Verifiable Environments for E-Commerce Conversational Agents
Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents
Cybersecurity Looks Like Proof of Work Now
<p><strong><a href="https://www.dbreunig.com/2026/04/14/cybersecurity-is-proof-of-work-now.html">Cybersecurity Looks Like Proof of Work Now</a></strong></p> The UK's AI Safety Institute recently published <a href="https://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos-previews-cyber-capabilities">Our evaluation of Claude Mythos Preview’s cyber capabilities</a>, their own independent analysis of <a href="https://simonwillison.net/2026/Apr/7/project-glasswing/">Claude Mythos</a> which backs
Steve Yegge
<p><a href="https://twitter.com/steve_yegge/status/2043747998740689171">Steve Yegge</a>:</p> <blockquote> <p>I was chatting with my buddy at Google, who's been a tech director there for about 20 years, about their AI adoption. Craziest convo I've had all year.</p> <p>The TL;DR is that Google engineering appears to have the same AI adoption footprint as John Deere, the tractor company. Most of the industry has the same internal adoption curve: 20% agentic power users, 20% outright refusers, 60% s
Enterprises power agentic workflows in Cloudflare Agent Cloud with OpenAI
Cloudflare brings OpenAI’s GPT-5.4 and Codex to Agent Cloud, enabling enterprises to build, deploy, and scale AI agents for real-world tasks with speed and security.
Trump officials may be encouraging banks to test Anthropic’s Mythos model
The report is particularly surprising since the Department of Defense recently declared Anthropic a supply-chain risk.
From LLMs to hallucinations, here’s a simple guide to common AI terms
The rise of AI has brought an avalanche of new terms and slang. Here is a glossary with definitions of some of the most important words and phrases you might encounter.
CyberAgent moves faster with ChatGPT Enterprise and Codex
CyberAgent uses ChatGPT Enterprise and Codex to securely scale AI adoption, improve quality, and accelerate decisions across advertising, media, and gaming.
Poke makes using AI agents as easy as sending a text
Poke brings AI agents to everyday users via text message by handling tasks and automations without complex setup, apps, or technical know-how.
ALTK‑Evolve: On‑the‑Job Learning for AI Agents
Introducing the Child Safety Blueprint
Discover OpenAI’s Child Safety Blueprint—a roadmap for building AI responsibly with safeguards, age-appropriate design, and collaboration to protect and empower young people online.
Enabling agent-first process redesign
Unlike static, rules-based systems, AI agents can learn, adapt, and optimize processes dynamically. As they interact with data, systems, people, and other agents in real time, AI agents can execute entire workflows autonomously. But unlocking their potential requires redesigning processes around agents rather than bolting them onto fragmented legacy workflows using traditional optimization methods. Companies…
Announcing the OpenAI Safety Fellowship
A pilot program to support independent safety and alignment research and develop the next generation of talent
Eight years of wanting, three months of building with AI
<p><strong><a href="https://lalitm.com/post/building-syntaqlite-ai/">Eight years of wanting, three months of building with AI</a></strong></p> Lalit Maganti provides one of my favorite pieces of long-form writing on agentic engineering I've seen in ages.</p> <p>They spent eight years thinking about and then three months building <a href="https://github.com/lalitMaganti/syntaqlite">syntaqlite</a>, which they describe as "<a href="https://lalitm.com/post/syntaqlite/">high-fidelity devtools that SQ
The cognitive impact of coding agents
<p>A fun thing about <a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/">recording a podcast</a> with a professional like Lenny Rachitsky is that his team know how to slice the resulting video up into TikTok-sized short form vertical videos. Here's <a href="https://x.com/lennysan/status/2039845666680176703">one he shared on Twitter today</a> which ended up attracting over 1.1m views!</p> <p><video src="https://static.simonwillison.net/static/2026/cognitive-cost.mp4" poster="https://st
Chatbots are now prescribing psychiatric drugs
Utah is allowing an AI system to prescribe psychiatric drugs without a doctor. It's only the second time the state - and the country - has delegated this kind of clinical authority to AI. State officials say it could bring costs down and ease care shortages, but physicians warn the system is opaque, risky, and […]
Highlights from my conversation about agentic engineering on Lenny's Podcast
<p>I was a guest on Lenny Rachitsky's podcast, in a new episode titled <a href="https://www.lennysnewsletter.com/p/an-ai-state-of-the-union">An AI state of the union: We've passed the inflection point, dark factories are coming, and automation timelines</a>. It's available on <a href="https://youtu.be/wc8FBhQtdsA">YouTube</a>, <a href="https://open.spotify.com/episode/0DVjwLT6wgtscdB78Qf1BQ">Spotify</a>, and <a href="https://podcasts.apple.com/us/podcast/an-ai-state-of-the-union-weve-passed-the/
New ways to balance cost and reliability in the Gemini API
<img src="https://storage.googleapis.com/gweb-uniblog-publish-prod/images/cost_reliability_Gemini_API-soc.max-600x600.format-webp.webp">Google is introducing two new inference tiers to the Gemini API, Flex and Priority, to balance cost and latency.
March 2026 sponsors-only newsletter
<p>I just sent the March edition of my <a href="https://github.com/sponsors/simonw/">sponsors-only monthly newsletter</a>. If you are a sponsor (or if you start a sponsorship now) you can <a href="https://github.com/simonw-private/monthly/blob/main/2026-03-march.md">access it here</a>. In this month's newsletter:</p> <ul> <li>More agentic engineering patterns</li> <li>Streaming experts with MoE models on a Mac</li> <li>Model releases in March</li> <li>Vibe porting</li> <li>Supply chain attacks a
Gradient Labs gives every bank customer an AI account manager
Gradient Labs uses GPT-4.1 and GPT-5.4 mini and nano to power AI agents that automate banking support workflows with low latency and high reliability.
Stanford study outlines dangers of asking AI chatbots for personal advice
While there’s been plenty of debate about AI sycophancy, a new study by Stanford computer scientists attempts to measure how harmful that tendency might be.
Quoting Matt Webb
<blockquote cite="https://interconnected.org/home/2026/03/28/architecture"><p>The thing about agentic coding is that agents grind problems into dust. Give an agent a problem and a while loop and - long term - it’ll solve that problem even if it means burning a trillion tokens and re-writing down to the silicon. [...]</p> <p>But we want AI agents to solve coding problems quickly and in a way that is maintainable and adaptive and composable (benefiting from improvements elsewhere), and where every
You can now transfer your chats and personal information from other chatbots directly into Gemini
Google is launching "switching tools" that, just as it sounds, will make it easier for users of other chatbots to switch to Gemini.
Apple will reportedly allow other AI chatbots to plug into Siri
Apple's iOS 27 update will allow users to choose the AI chatbot they want to link with Siri. That's according to a report from Bloomberg's Mark Gurman, who says third-party chatbots downloaded from the App Store, like Google's Gemini or Anthropic's Claude, will be able to fetch replies for Siri - similar to how the […]
Thoughts on slowing the fuck down
<p><strong><a href="https://news.ycombinator.com/item?id=47517539">Thoughts on slowing the fuck down</a></strong></p> Mario Zechner created the <a href="https://github.com/badlogic/pi-mono">Pi agent framework</a> used by OpenClaw, giving considerable credibility to his opinions on current trends in agentic engineering. He's not impressed:</p> <blockquote> <p>We have basically given up all discipline and agency for a sort of addiction, where your highest goal is to produce the largest amount of c
Agentic commerce runs on truth and context
Imagine telling a digital agent, “Use my points and book a family trip to Italy. Keep it within budget, pick hotels we’ve liked before, and handle the details.” Instead of returning a list of links, the agent assembles an itinerary and executes the purchase. That shift, from assistance to execution, is what makes agentic AI…
Introducing the OpenAI Safety Bug Bounty program
OpenAI launches a Safety Bug Bounty program to identify AI abuse and safety risks, including agentic vulnerabilities, prompt injection, and data exfiltration.
Arm’s first CPU ever will plug into Meta’s AI data centers later this year
After decades of only licensing its chip designs for others to use, UK-based Arm revealed the first chip it's producing on its own, and the first customer. Dubbed the Arm AGI CPU, it's another chip designed for inference, or running the cloud processing for AI tools like AI agents that can continue to spawn more […]
Exclusive eBook: Are we ready to hand AI agents the keys?
We’re starting to give AI agents real autonomy, but are we prepared for what could happen next? This subscriber-only eBook explores this and angles from experts, such as “If we continue on the current path … we are basically playing Russian roulette with humanity.” by Grace Huckins June 12, 2025 Related Stories: Access all subscriber-only…
Mirage raises $75M to continue building models for its AI video-editing app Captions
Mirage, the maker of video-editing app Captions, has raised $75 million in growth financing from General Catalyst's Customer Value Fund (CVF).
Powering product discovery in ChatGPT
ChatGPT introduces richer, visually immersive shopping powered by the Agentic Commerce Protocol, enabling product discovery, side-by-side comparisons, and merchant integration.
A New Framework for Evaluating Voice Agents (EVA)
Using Git with coding agents
<p><em><a href="https://simonwillison.net/guides/agentic-engineering-patterns/">Agentic Engineering Patterns</a> ></em></p> <p>Git is a key tool for working with coding agents. Keeping code in version control lets us record how that code changes over time and investigate and reverse any mistakes. All of the coding agents are fluent in using Git's features, both basic and advanced.</p> <p>This fluency means we can be more ambitious about how we use Git ourselves. We don't need to memorize <em>how
WordPress.com now lets AI agents write and publish posts, and more
New AI agents on WordPress.com could lower barriers to publishing while increasing machine-generated content across the web.
How we monitor internal coding agents for misalignment
March 19, 2026SafetyPublicationHow we monitor internal coding agents for misalignmentUsing our most powerful models to detect and study misaligned behavior in real-world deployments.ShareAI systems are beginning to act with greater autonomy in real-world environments at scale. As their capabilities advance, they are able to take on increasingly complex, high-impact tasks and interact with tools, systems, and workflows in ways that resemble human collaborators.A core part of OpenAI’s mission is helping the world navigate this transition to AGI responsibly. That means not only building highly capable systems, but also developing the methods, infrastructure, and approaches needed to deploy and manage them safely as their capabilities continue to grow.Monitoring internally deployed agents is one of the key ways we’re doing this, and it allows us both to learn from real-world usage and to identify and mitigate emerging risks. Over the last few months, we’ve built and refined a monitoring system for coding agents we use internally as one part of our broader safety approach. This post describes how the system works, what we’ve learned so far, and how we see this approach evolving as agentic capabilities continue to advance.Using advanced coding agents internally gives us an opportunity to understand how they behave in realistic workflows, and how users apply these agents to complex, tool-rich tasks across a wide range of environments. Internal coding agent deployments at OpenAI also come with unique risk factors for misalignment: they have access to select internal OpenAI systems, can inspect the documentation and code of their safeguards, or even attempt to modify those safeguards, and could take actions in ways that could affect their future versions. These properties make internal deployments a uniquely important setting to innovate on monitoring infrastructure, and we believe similar safeguards should be standard for internal coding agent deployments across the industry.Our approach & how it works As capabilities advance, we believe monitoring agentic behavior, including both the model’s actions and their internal reasoning, will become an increasingly important safety tool. Monitoring internal coding agents helps us understand both how often misaligned behavior occurs and what it looks like in practice. The goal is to flag misalignment-relevant behaviors that can be hard to surface before the model is deployed (though we’re continuing our efforts in this area(opens in a new window)) because they often appear only in realistic, tool-rich workflows and extended sessions. It is also critical to preserve user privacy and data security from coding agent sessions, and only to surface problematic behaviors in narrowly-scoped ways.To support this, we’ve developed a low-latency internal monitoring system, powered by GPT‑5.4 Thinking at maximum reasoning effort, that reviews the agent’s interactions and alerts us to actions that may be inconsistent with a user’s intent, or that may violate our own internal security or compliance policies. The monitor logs and analyzes the agents’ chains of thought and actions, and automatically alerts for suspicious or problematic behavior so we can triage quickly, mitigate consequences, and improve our other safeguards. This feedback loop allows us to rapidly detect and resolve potential security issues, improve the alignment of our models, and is a core component of our long-term agent security strategy.In its latest form, our monitoring system reviews interactions within 30 minutes of completion, categorizing the interaction and assigning it a severity level. As the latency decreases towards near real-time review, the security benefits increase significantly. Eventually, the monitor may be able to help evaluate coding agent actions before they are taken—providing another important defense-in-depth control alongside other existing security monitors. Initial results appear strong, with the monitoring system consistently outperforming employee escalations—flagging every interaction that employees independently reported through internal channels. The monitor has also surfaced many additional behaviors that might have otherwise gone unnoticed by internal users. Through this work, we’ve found that our models can be overly eager to work around restrictions in pursuit of a user-specified goal, especially when the user request inadvertently encourages it. That said, so far we have not seen evidence of motivations beyond the original task (for example, self-preservation or scheming). A very small fraction of traffic (<0.1%) primarily from bespoke or local setups is currently outside our monitoring coverage, and we are working to close that gap.What we monitor forThe monitoring system privately runs across all standard internal coding agent deployments, viewing the full conversation history, including all chains of thought, all user, developer, and assistant messages, along with tool calls an
Meta is having trouble with rogue AI agents
A rogue AI agent inadvertently exposed Meta company and user data to engineers who didn't have permission to see it.
Nothing CEO Carl Pei says smartphone apps will disappear as AI agents take their place
Nothing CEO Carl Pei says AI agents will eventually replace apps, shifting smartphones toward systems that understand intent and act on a user's behalf.
Holotron-12B - High Throughput Computer Use Agent
Subagents
<p><em><a href="https://simonwillison.net/guides/agentic-engineering-patterns/">Agentic Engineering Patterns</a> ></em></p> <p>LLMs are restricted by their <strong>context limit</strong> - how many tokens they can fit in their working memory at any given time. These values have not increased much over the past two years even as the LLMs themselves have seen dramatic improvements in their abilities - they generally top out at around 1,000,000, and benchmarks frequently report better quality resul
OpenAI Japan announces Japan Teen Safety Blueprint to put teen safety first
OpenAI Japan announces the Japan Teen Safety Blueprint, introducing stronger age protections, parental controls, and well-being safeguards for teens using generative AI.
Use subagents and custom agents in Codex
<p><strong><a href="https://developers.openai.com/codex/subagents">Use subagents and custom agents in Codex</a></strong></p> Subagents were announced in general availability today for OpenAI Codex, after several weeks of preview behind a feature flag.</p> <p>They're very similar to the Claude Code implementation, with default subagents for "explorer", "worker" and "default". It's unclear to me what the difference between "worker" and "default" is but based on their CSV example I think "worker" i
Quoting A member of Anthropic’s alignment-science team
<blockquote cite="https://www.newyorker.com/news/annals-of-inquiry/the-pentagon-went-to-war-with-anthropic-whats-really-at-stake?_sp=9a6e0ff7-2bfd-46f8-a9e1-3941ef2003b5.1773495048769"><p>The point of <a href="https://simonwillison.net/2025/Jun/20/agentic-misalignment/">the blackmail exercise</a> was to have something to describe to policymakers—results that are visceral enough to land with people, and make misalignment risk actually salient in practice for people who had never thought about it
Coding agents for data analysis
<p><strong><a href="https://simonw.github.io/nicar-2026-coding-agents/">Coding agents for data analysis</a></strong></p> Here's the handout I prepared for my NICAR 2026 workshop "Coding agents for data analysis" - a three hour session aimed at data journalists demonstrating ways that tools like Claude Code and OpenAI Codex can be used to explore, analyze and clean data.</p> <p>Here's the table of contents:</p> <blockquote> <ul> <li><a href="https://simonw.github.io/nicar-2026-coding-agents/codin
How coding agents work
<p><em><a href="https://simonwillison.net/guides/agentic-engineering-patterns/">Agentic Engineering Patterns</a> ></em></p> <p>As with any tool, understanding how <a href="https://simonwillison.net/guides/agentic-engineering-patterns/what-is-agentic-engineering/">coding agents</a> work under the hood can help you make better decisions about how to apply them.</p> <p>A coding agent is a piece of software that acts as a <strong>harness</strong> for an LLM, extending that LLM with additional capabi
What is agentic engineering?
<p><em><a href="https://simonwillison.net/guides/agentic-engineering-patterns/">Agentic Engineering Patterns</a> ></em></p> <p>I use the term <strong>agentic engineering</strong> to describe the practice of developing software with the assistance of coding agents.</p> <p>What are <strong>coding agents</strong>? They're agents that can both write and execute code. Popular examples include <a href="https://code.claude.com/">Claude Code</a>, <a href="https://openai.com/codex/">OpenAI Codex</a>, and
My fireside chat about agentic engineering at the Pragmatic Summit
<p>I was a speaker last month at the <a href="https://www.pragmaticsummit.com/">Pragmatic Summit</a> in San Francisco, where I participated in a fireside chat session about <a href="https://simonwillison.net/guides/agentic-engineering-patterns/">Agentic Engineering</a> hosted by Eric Lui from Statsig.</p> <p>The video is <a href="https://www.youtube.com/watch?v=owmJyKVu5f8">available on YouTube</a>. Here are my highlights from the conversation.</p> <iframe style="margin-top: 1.5em; margin-bottom
Designing AI agents to resist prompt injection
How ChatGPT defends against prompt injection and social engineering by constraining risky actions and protecting sensitive data in agent workflows.
AI should help us produce better code
<p><em><a href="https://simonwillison.net/guides/agentic-engineering-patterns/">Agentic Engineering Patterns</a> ></em></p> <p>Many developers worry that outsourcing their code to AI tools will result in a drop in quality, producing bad code that's churned out fast enough that decision makers are willing to overlook its flaws.</p> <p>If adopting coding agents demonstrably reduces the quality of the code and features you are producing, you should address that problem directly: figure out which as
Production query plans without production data
<p><strong><a href="https://boringsql.com/posts/portable-stats/">Production query plans without production data</a></strong></p> Radim Marek describes the new <a href="https://www.postgresql.org/docs/current/functions-admin.html#FUNCTIONS-ADMIN-STATSMOD"><code>pg_restore_relation_stats()</code> and <code>pg_restore_attribute_stats()</code> functions</a> that were introduced <a href="https://www.postgresql.org/docs/current/release-18.html">in PostgreSQL 18</a> in September 2025.</p> <p>The Postgr
LeRobot v0.5.0: Scaling Every Dimension
How Balyasny Asset Management built an AI research engine for investing
See how Balyasny built an AI research system with GPT-5.4, rigorous model evaluation, and agent workflows to transform investment analysis at scale.
Introducing the Stateful Runtime Environment for Agents in Amazon Bedrock
Stateful Runtime for Agents in Amazon Bedrock brings persistent orchestration, memory, and secure execution to multi-step AI workflows powered by OpenAI.
Nano Banana 2: Combining Pro capabilities with lightning-fast speed
<img src="https://storage.googleapis.com/gweb-uniblog-publish-prod/images/NB2_Hero.max-600x600.format-webp.webp">Our latest image generation model offers advanced world knowledge, production-ready specs, subject consistency and more, all at Flash speed.
Pacific Northwest National Laboratory and OpenAI partner to accelerate federal permitting
OpenAI and Pacific Northwest National Laboratory introduce DraftNEPABench, a new benchmark evaluating how AI coding agents can accelerate federal permitting—showing potential to reduce NEPA drafting time by up to 15% and modernize infrastructure reviews.
Why we no longer evaluate SWE-bench Verified
SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. Our analysis shows flawed tests and training leakage. We recommend SWE-bench Pro.
OpenAI announces Frontier Alliance Partners
OpenAI announces Frontier Alliance Partners to help enterprises move from AI pilots to production with secure, scalable agent deployments.
Advancing independent research on AI alignment
OpenAI commits $7.5M to The Alignment Project to fund independent AI alignment research, strengthening global efforts to address AGI safety and security risks.
IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST
Introducing EVMbench
OpenAI and Paradigm introduce EVMbench, a benchmark evaluating AI agents’ ability to detect, patch, and exploit high-severity smart contract vulnerabilities.
Beyond rate limits: scaling access to Codex and Sora
How OpenAI built a real-time access system combining rate limits, usage tracking, and credits to power continuous access to Sora and Codex.
Scaling social science research
GABRIEL is a new open-source toolkit from OpenAI that uses GPT to turn qualitative text and images into quantitative data, helping social scientists analyze research at scale.
OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments
Harness engineering: leveraging Codex in an agent-first world
By Ryan Lopopolo, Member of the Technical Staff
Bringing ChatGPT to GenAI.mil
OpenAI for Government announces the deployment of a custom ChatGPT on GenAI.mil, bringing secure, safety-forward AI to U.S. defense teams.
Unlocking the Codex harness: how we built the App Server
Learn how to embed the Codex agent using the Codex App Server, a bidirectional JSON-RPC API powering streaming progress, tool use, approvals, and diffs.
Inside OpenAI’s in-house data agent
How OpenAI built an in-house AI data agent that uses GPT-5, Codex, and memory to reason over massive datasets and deliver reliable insights in minutes.
Keeping your data safe when an AI agent clicks a link
Learn how OpenAI protects user data when AI agents open links, preventing URL-based data exfiltration and prompt injection with built-in safeguards.
Alyah ⭐️: Toward Robust Evaluation of Emirati Dialect Capabilities in Arabic LLMs
Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective
Powering tax donations with AI powered personalized recommendations
TRUSTBANK partnered with Recursive to build Choice AI using OpenAI models, delivering personalized, conversational recommendations that simplify Furusato Nozei gift discovery. A multi-agent system helps donors navigate thousands of options and find gifts that match their preferences.
Unrolling the Codex agent loop
A technical deep dive into the Codex agent loop, explaining how Codex CLI orchestrates models, tools, prompts, and performance using the Responses API.
Scaling PostgreSQL to power 800 million ChatGPT users
An inside look at how OpenAI scaled PostgreSQL to millions of queries per second using replicas, caching, rate limiting, and workload isolation.
AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality
Netomi’s lessons for scaling agentic systems into the enterprise
How Netomi scales enterprise AI agents using GPT-4.1 and GPT-5.2—combining concurrency, governance, and multi-step reasoning for reliable production workflows.
NVIDIA brings agents to life with DGX Spark and Reachy Mini
AprielGuard: A Guardrail for Safety and Adversarial Robustness in Modern LLM Systems
Continuously hardening ChatGPT Atlas against prompt injection
OpenAI is strengthening ChatGPT Atlas against prompt injection attacks using automated red teaming trained with reinforcement learning. This proactive discover-and-patch loop helps identify novel exploits early and harden the browser agent’s defenses as AI becomes more agentic.
Evaluating chain-of-thought monitorability
OpenAI introduces a new framework and evaluation suite for chain-of-thought monitorability, covering 13 evaluations across 24 environments. Our findings show that monitoring a model’s internal reasoning is far more effective than monitoring outputs alone, offering a promising path toward scalable control as AI systems grow more capable.
Updating our Model Spec with teen protections
OpenAI is updating its Model Spec with new Under-18 Principles that define how ChatGPT should support teens with safe, age-appropriate guidance grounded in developmental science. The update strengthens guardrails, clarifies expected model behavior in higher-risk situations, and builds on our broader work to improve teen safety across ChatGPT.
Addendum to GPT-5.2 System Card: GPT-5.2-Codex
This system card outlines the comprehensive safety measures implemented for GPT‑5.2-Codex. It details both model-level mitigations, such as specialized safety training for harmful tasks and prompt injections, and product-level mitigations like agent sandboxing and configurable network access.
The Open Evaluation Standard: Benchmarking NVIDIA Nemotron 3 Nano with NeMo Evaluator
Evaluating AI’s ability to perform scientific research tasks
OpenAI introduces FrontierScience, a benchmark testing AI reasoning in physics, chemistry, and biology to measure progress toward real scientific research.
OpenAI co-founds Agentic AI Foundation, donates AGENTS.md
OpenAI co-founds the Agentic AI Foundation under the Linux Foundation and donates AGENTS.md to support open, interoperable standards for safe agentic AI.
DeepMath: A lightweight math reasoning Agent with smolagents
Inside Mirakl's agentic commerce vision
Mirakl is redefining commerce through AI agents and ChatGPT Enterprise—achieving faster documentation, smarter customer support, and building toward agent-native commerce with Mirakl Nexus.
Strengthening our safety ecosystem with external testing
OpenAI works with independent experts to evaluate frontier AI systems. Third-party testing strengthens safety, validates safeguards, and increases transparency in how we assess model capabilities and risks.
How evals drive the next chapter in AI for businesses
Learn how evals help businesses define, measure, and improve AI performance—reducing risk, boosting productivity, and driving strategic advantage.
GPT-5.1-Codex-Max System Card
This system card outlines the comprehensive safety measures implemented for GPT‑5.1-CodexMax. It details both model-level mitigations, such as specialized safety training for harmful tasks and prompt injections, and product-level mitigations like agent sandboxing and configurable network access.
How Scania is accelerating work with AI across its global workforce
Description: Global manufacturer Scania is scaling AI with ChatGPT Enterprise. With team-based onboarding and strong guardrails, AI is boosting productivity, quality, and innovation.
How Philips is scaling AI literacy across 70,000 employees
Philips is scaling AI literacy with ChatGPT Enterprise, training 70,000 employees to use AI responsibly and improve healthcare outcomes worldwide.
GPT-5.1 Instant and GPT-5.1 Thinking System Card Addendum
This GPT-5 system card addendum provides updated safety metrics for GPT-5.1 Instant and Thinking, including new evaluations for mental health and emotional reliance.
Notion’s rebuild for agentic AI: How GPT‑5 helped unlock autonomous workflows
Discover how Notion rebuilt its AI architecture with GPT-5 to create autonomous agents that reason, act, and adapt across workflows. Learn how this shift unlocked smarter, faster, and more flexible productivity in Notion 3.0.
From Pilot to Practice: How BBVA Is Scaling AI Across the Organization
BBVA is reimagining how employees work with ChatGPT Enterprise, embedding AI into everyday operations. The bank has saved hours per week per employee, created 20,000+ Custom GPTs, and achieved up to 80% efficiency gains.
Introducing the Teen Safety Blueprint
Discover OpenAI’s Teen Safety Blueprint—a roadmap for building AI responsibly with safeguards, age-appropriate design, and collaboration to protect and empower young people online.
Introducing Aardvark: OpenAI’s agentic security researcher
OpenAI introduces Aardvark, an AI-powered security researcher that autonomously finds, validates, and helps fix software vulnerabilities at scale. The system is in private beta—sign up to join early testing.