AI in Production

2,213 articles total

AI in Production

Yelp is making its AI chatbot way more useful

productionThe Verge AI4/21/2026

Yelp is giving its chatbot assistant a major upgrade, turning the platform into something closer to a digital concierge with a suite of new features designed for "getting things done." The move, one of several AI-focused updates in recent months, is part of a broader industry push to make AI more relevant and practically useful […]

How to Ground a Korean AI Agent in Real Demographics with Synthetic Personas

productionHugging Face Blog4/21/2026

Adding a new content type to my blog-to-newsletter tool

productionSimon Willison's Weblog4/18/2026

<a href="https://simonwillison.net/guides/agentic-engineering-patterns/">Agentic Engineering Patterns</a> > Here's an example of a deceptively short prompt that got a quite a lot of work done in a single shot. First, some background. I send out a <a href="https://simonw.substack.com/">free Substack newsletter</a> around once a week containing content copied-and-pasted from my blog. I'm effectively using Substack as a lightweight way to allow people to subscribe to my bl

Dairy Queen is putting an AI chatbot in its drive-thrus

productionThe Verge AI4/17/2026

Dairy Queen is becoming the latest fast food chain to get in on AI, as it's bringing a chatbot to dozens of its drive-thrus across the US and Canada. It aims to help speed up drive-thru service and "encourage customers to add more food to their orders," according to The Wall Street Journal. Following a […]

Ecom-RLVE: Adaptive Verifiable Environments for E-Commerce Conversational Agents

productionHugging Face Blog4/16/2026

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

productionHugging Face Blog4/15/2026

Cybersecurity Looks Like Proof of Work Now

productionSimon Willison's Weblog4/14/2026

<a href="https://www.dbreunig.com/2026/04/14/cybersecurity-is-proof-of-work-now.html">Cybersecurity Looks Like Proof of Work Now</a> The UK's AI Safety Institute recently published <a href="https://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos-previews-cyber-capabilities">Our evaluation of Claude Mythos Preview’s cyber capabilities</a>, their own independent analysis of <a href="https://simonwillison.net/2026/Apr/7/project-glasswing/">Claude Mythos</a> which backs

Steve Yegge

productionSimon Willison's Weblog4/13/2026

<a href="https://twitter.com/steve_yegge/status/2043747998740689171">Steve Yegge</a>: <blockquote> I was chatting with my buddy at Google, who's been a tech director there for about 20 years, about their AI adoption. Craziest convo I've had all year. The TL;DR is that Google engineering appears to have the same AI adoption footprint as John Deere, the tractor company. Most of the industry has the same internal adoption curve: 20% agentic power users, 20% outright refusers, 60% s

Enterprises power agentic workflows in Cloudflare Agent Cloud with OpenAI

productionOpenAI Blog4/13/2026

Cloudflare brings OpenAI’s GPT-5.4 and Codex to Agent Cloud, enabling enterprises to build, deploy, and scale AI agents for real-world tasks with speed and security.

Trump officials may be encouraging banks to test Anthropic’s Mythos model

productionTechCrunch AI4/12/2026

The report is particularly surprising since the Department of Defense recently declared Anthropic a supply-chain risk.

From LLMs to hallucinations, here’s a simple guide to common AI terms

productionTechCrunch AI4/12/2026

The rise of AI has brought an avalanche of new terms and slang. Here is a glossary with definitions of some of the most important words and phrases you might encounter.

CyberAgent moves faster with ChatGPT Enterprise and Codex

productionOpenAI Blog4/9/2026

CyberAgent uses ChatGPT Enterprise and Codex to securely scale AI adoption, improve quality, and accelerate decisions across advertising, media, and gaming.

Poke makes using AI agents as easy as sending a text

productionTechCrunch AI4/8/2026

Poke brings AI agents to everyday users via text message by handling tasks and automations without complex setup, apps, or technical know-how.

ALTK‑Evolve: On‑the‑Job Learning for AI Agents

productionHugging Face Blog4/8/2026

Introducing the Child Safety Blueprint

productionOpenAI Blog4/8/2026

Discover OpenAI’s Child Safety Blueprint—a roadmap for building AI responsibly with safeguards, age-appropriate design, and collaboration to protect and empower young people online.

Enabling agent-first process redesign

productionMIT Technology Review AI4/7/2026

Unlike static, rules-based systems, AI agents can learn, adapt, and optimize processes dynamically. As they interact with data, systems, people, and other agents in real time, AI agents can execute entire workflows autonomously. But unlocking their potential requires redesigning processes around agents rather than bolting them onto fragmented legacy workflows using traditional optimization methods. Companies…

Announcing the OpenAI Safety Fellowship

productionOpenAI Blog4/6/2026

A pilot program to support independent safety and alignment research and develop the next generation of talent

Eight years of wanting, three months of building with AI

productionSimon Willison's Weblog4/5/2026

<a href="https://lalitm.com/post/building-syntaqlite-ai/">Eight years of wanting, three months of building with AI</a> Lalit Maganti provides one of my favorite pieces of long-form writing on agentic engineering I've seen in ages. They spent eight years thinking about and then three months building <a href="https://github.com/lalitMaganti/syntaqlite">syntaqlite</a>, which they describe as "<a href="https://lalitm.com/post/syntaqlite/">high-fidelity devtools that SQ

The cognitive impact of coding agents

productionSimon Willison's Weblog4/3/2026

A fun thing about <a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/">recording a podcast</a> with a professional like Lenny Rachitsky is that his team know how to slice the resulting video up into TikTok-sized short form vertical videos. Here's <a href="https://x.com/lennysan/status/2039845666680176703">one he shared on Twitter today</a> which ended up attracting over 1.1m views! <video src="https://static.simonwillison.net/static/2026/cognitive-cost.mp4" poster="https://st

Chatbots are now prescribing psychiatric drugs

productionThe Verge AI4/3/2026

Utah is allowing an AI system to prescribe psychiatric drugs without a doctor. It's only the second time the state - and the country - has delegated this kind of clinical authority to AI. State officials say it could bring costs down and ease care shortages, but physicians warn the system is opaque, risky, and […]

Highlights from my conversation about agentic engineering on Lenny's Podcast

productionSimon Willison's Weblog4/2/2026

I was a guest on Lenny Rachitsky's podcast, in a new episode titled <a href="https://www.lennysnewsletter.com/p/an-ai-state-of-the-union">An AI state of the union: We've passed the inflection point, dark factories are coming, and automation timelines</a>. It's available on <a href="https://youtu.be/wc8FBhQtdsA">YouTube</a>, <a href="https://open.spotify.com/episode/0DVjwLT6wgtscdB78Qf1BQ">Spotify</a>, and <a href="https://podcasts.apple.com/us/podcast/an-ai-state-of-the-union-weve-passed-the/

New ways to balance cost and reliability in the Gemini API

productionGoogle AI Blog4/2/2026

<img src="https://storage.googleapis.com/gweb-uniblog-publish-prod/images/cost_reliability_Gemini_API-soc.max-600x600.format-webp.webp">Google is introducing two new inference tiers to the Gemini API, Flex and Priority, to balance cost and latency.

March 2026 sponsors-only newsletter

productionSimon Willison's Weblog4/2/2026

I just sent the March edition of my <a href="https://github.com/sponsors/simonw/">sponsors-only monthly newsletter</a>. If you are a sponsor (or if you start a sponsorship now) you can <a href="https://github.com/simonw-private/monthly/blob/main/2026-03-march.md">access it here</a>. In this month's newsletter: <ul> <li>More agentic engineering patterns</li> <li>Streaming experts with MoE models on a Mac</li> <li>Model releases in March</li> <li>Vibe porting</li> <li>Supply chain attacks a

Gradient Labs gives every bank customer an AI account manager

productionOpenAI Blog4/1/2026

Gradient Labs uses GPT-4.1 and GPT-5.4 mini and nano to power AI agents that automate banking support workflows with low latency and high reliability.

Stanford study outlines dangers of asking AI chatbots for personal advice

productionTechCrunch AI3/28/2026

While there’s been plenty of debate about AI sycophancy, a new study by Stanford computer scientists attempts to measure how harmful that tendency might be.

Quoting Matt Webb

productionSimon Willison's Weblog3/28/2026

<blockquote cite="https://interconnected.org/home/2026/03/28/architecture">The thing about agentic coding is that agents grind problems into dust. Give an agent a problem and a while loop and - long term - it’ll solve that problem even if it means burning a trillion tokens and re-writing down to the silicon. [...] But we want AI agents to solve coding problems quickly and in a way that is maintainable and adaptive and composable (benefiting from improvements elsewhere), and where every

You can now transfer your chats and personal information from other chatbots directly into Gemini

productionTechCrunch AI3/26/2026

Google is launching "switching tools" that, just as it sounds, will make it easier for users of other chatbots to switch to Gemini.

Apple will reportedly allow other AI chatbots to plug into Siri

productionThe Verge AI3/26/2026

Apple's iOS 27 update will allow users to choose the AI chatbot they want to link with Siri. That's according to a report from Bloomberg's Mark Gurman, who says third-party chatbots downloaded from the App Store, like Google's Gemini or Anthropic's Claude, will be able to fetch replies for Siri - similar to how the […]

Thoughts on slowing the fuck down

productionSimon Willison's Weblog3/25/2026

<a href="https://news.ycombinator.com/item?id=47517539">Thoughts on slowing the fuck down</a> Mario Zechner created the <a href="https://github.com/badlogic/pi-mono">Pi agent framework</a> used by OpenClaw, giving considerable credibility to his opinions on current trends in agentic engineering. He's not impressed: <blockquote> We have basically given up all discipline and agency for a sort of addiction, where your highest goal is to produce the largest amount of c

Agentic commerce runs on truth and context

productionMIT Technology Review AI3/25/2026

Imagine telling a digital agent, “Use my points and book a family trip to Italy. Keep it within budget, pick hotels we’ve liked before, and handle the details.” Instead of returning a list of links, the agent assembles an itinerary and executes the purchase. That shift, from assistance to execution, is what makes agentic AI…

Introducing the OpenAI Safety Bug Bounty program

productionOpenAI Blog3/25/2026

OpenAI launches a Safety Bug Bounty program to identify AI abuse and safety risks, including agentic vulnerabilities, prompt injection, and data exfiltration.

Arm’s first CPU ever will plug into Meta’s AI data centers later this year

productionThe Verge AI3/24/2026

After decades of only licensing its chip designs for others to use, UK-based Arm revealed the first chip it's producing on its own, and the first customer. Dubbed the Arm AGI CPU, it's another chip designed for inference, or running the cloud processing for AI tools like AI agents that can continue to spawn more […]

Exclusive eBook: Are we ready to hand AI agents the keys?

productionMIT Technology Review AI3/24/2026

We’re starting to give AI agents real autonomy, but are we prepared for what could happen next? This subscriber-only eBook explores this and angles from experts, such as “If we continue on the current path … we are basically playing Russian roulette with humanity.” by Grace Huckins June 12, 2025 Related Stories: Access all subscriber-only…

Mirage raises $75M to continue building models for its AI video-editing app Captions

productionTechCrunch AI3/24/2026

Mirage, the maker of video-editing app Captions, has raised $75 million in growth financing from General Catalyst's Customer Value Fund (CVF).

Powering product discovery in ChatGPT

productionOpenAI Blog3/24/2026

ChatGPT introduces richer, visually immersive shopping powered by the Agentic Commerce Protocol, enabling product discovery, side-by-side comparisons, and merchant integration.

A New Framework for Evaluating Voice Agents (EVA)

productionHugging Face Blog3/24/2026

Using Git with coding agents

productionSimon Willison's Weblog3/21/2026

<a href="https://simonwillison.net/guides/agentic-engineering-patterns/">Agentic Engineering Patterns</a> > Git is a key tool for working with coding agents. Keeping code in version control lets us record how that code changes over time and investigate and reverse any mistakes. All of the coding agents are fluent in using Git's features, both basic and advanced. This fluency means we can be more ambitious about how we use Git ourselves. We don't need to memorize how

WordPress.com now lets AI agents write and publish posts, and more

productionTechCrunch AI3/20/2026

New AI agents on WordPress.com could lower barriers to publishing while increasing machine-generated content across the web.

How we monitor internal coding agents for misalignment

productionOpenAI Blog3/19/2026

March 19, 2026SafetyPublicationHow we monitor internal coding agents for misalignmentUsing our most powerful models to detect and study misaligned behavior in real-world deployments.ShareAI systems are beginning to act with greater autonomy in real-world environments at scale. As their capabilities advance, they are able to take on increasingly complex, high-impact tasks and interact with tools, systems, and workflows in ways that resemble human collaborators.A core part of OpenAI’s mission is helping the world navigate this transition to AGI responsibly. That means not only building highly capable systems, but also developing the methods, infrastructure, and approaches needed to deploy and manage them safely as their capabilities continue to grow.Monitoring internally deployed agents is one of the key ways we’re doing this, and it allows us both to learn from real-world usage and to identify and mitigate emerging risks. Over the last few months, we’ve built and refined a monitoring system for coding agents we use internally as one part of our broader safety approach. This post describes how the system works, what we’ve learned so far, and how we see this approach evolving as agentic capabilities continue to advance.Using advanced coding agents internally gives us an opportunity to understand how they behave in realistic workflows, and how users apply these agents to complex, tool-rich tasks across a wide range of environments. Internal coding agent deployments at OpenAI also come with unique risk factors for misalignment: they have access to select internal OpenAI systems, can inspect the documentation and code of their safeguards, or even attempt to modify those safeguards, and could take actions in ways that could affect their future versions. These properties make internal deployments a uniquely important setting to innovate on monitoring infrastructure, and we believe similar safeguards should be standard for internal coding agent deployments across the industry.Our approach & how it works As capabilities advance, we believe monitoring agentic behavior, including both the model’s actions and their internal reasoning, will become an increasingly important safety tool. Monitoring internal coding agents helps us understand both how often misaligned behavior occurs and what it looks like in practice. The goal is to flag misalignment-relevant behaviors that can be hard to surface before the model is deployed (though we’re continuing our efforts in this area⁠(opens in a new window)) because they often appear only in realistic, tool-rich workflows and extended sessions. It is also critical to preserve user privacy and data security from coding agent sessions, and only to surface problematic behaviors in narrowly-scoped ways.To support this, we’ve developed a low-latency internal monitoring system, powered by GPT‑5.4 Thinking at maximum reasoning effort, that reviews the agent’s interactions and alerts us to actions that may be inconsistent with a user’s intent, or that may violate our own internal security or compliance policies. The monitor logs and analyzes the agents’ chains of thought and actions, and automatically alerts for suspicious or problematic behavior so we can triage quickly, mitigate consequences, and improve our other safeguards. This feedback loop allows us to rapidly detect and resolve potential security issues, improve the alignment of our models, and is a core component of our long-term agent security strategy.In its latest form, our monitoring system reviews interactions within 30 minutes of completion, categorizing the interaction and assigning it a severity level. As the latency decreases towards near real-time review, the security benefits increase significantly. Eventually, the monitor may be able to help evaluate coding agent actions before they are taken—providing another important defense-in-depth control alongside other existing security monitors. Initial results appear strong, with the monitoring system consistently outperforming employee escalations—flagging every interaction that employees independently reported through internal channels. The monitor has also surfaced many additional behaviors that might have otherwise gone unnoticed by internal users. Through this work, we’ve found that our models can be overly eager to work around restrictions in pursuit of a user-specified goal, especially when the user request inadvertently encourages it. That said, so far we have not seen evidence of motivations beyond the original task (for example, self-preservation or scheming). A very small fraction of traffic (<0.1%) primarily from bespoke or local setups is currently outside our monitoring coverage, and we are working to close that gap.What we monitor forThe monitoring system privately runs across all standard internal coding agent deployments, viewing the full conversation history, including all chains of thought, all user, developer, and assistant messages, along with tool calls an

Meta is having trouble with rogue AI agents

productionTechCrunch AI3/18/2026

A rogue AI agent inadvertently exposed Meta company and user data to engineers who didn't have permission to see it.

Nothing CEO Carl Pei says smartphone apps will disappear as AI agents take their place

productionTechCrunch AI3/18/2026

Nothing CEO Carl Pei says AI agents will eventually replace apps, shifting smartphones toward systems that understand intent and act on a user's behalf.

Holotron-12B - High Throughput Computer Use Agent

productionHugging Face Blog3/17/2026

Subagents

productionSimon Willison's Weblog3/17/2026

<a href="https://simonwillison.net/guides/agentic-engineering-patterns/">Agentic Engineering Patterns</a> > LLMs are restricted by their context limit - how many tokens they can fit in their working memory at any given time. These values have not increased much over the past two years even as the LLMs themselves have seen dramatic improvements in their abilities - they generally top out at around 1,000,000, and benchmarks frequently report better quality resul

OpenAI Japan announces Japan Teen Safety Blueprint to put teen safety first

productionOpenAI Blog3/17/2026

OpenAI Japan announces the Japan Teen Safety Blueprint, introducing stronger age protections, parental controls, and well-being safeguards for teens using generative AI.

Use subagents and custom agents in Codex

productionSimon Willison's Weblog3/16/2026

<a href="https://developers.openai.com/codex/subagents">Use subagents and custom agents in Codex</a> Subagents were announced in general availability today for OpenAI Codex, after several weeks of preview behind a feature flag. They're very similar to the Claude Code implementation, with default subagents for "explorer", "worker" and "default". It's unclear to me what the difference between "worker" and "default" is but based on their CSV example I think "worker" i

Quoting A member of Anthropic’s alignment-science team

productionSimon Willison's Weblog3/16/2026

<blockquote cite="https://www.newyorker.com/news/annals-of-inquiry/the-pentagon-went-to-war-with-anthropic-whats-really-at-stake?_sp=9a6e0ff7-2bfd-46f8-a9e1-3941ef2003b5.1773495048769">The point of <a href="https://simonwillison.net/2025/Jun/20/agentic-misalignment/">the blackmail exercise</a> was to have something to describe to policymakers—results that are visceral enough to land with people, and make misalignment risk actually salient in practice for people who had never thought about it

Coding agents for data analysis

productionSimon Willison's Weblog3/16/2026

<a href="https://simonw.github.io/nicar-2026-coding-agents/">Coding agents for data analysis</a> Here's the handout I prepared for my NICAR 2026 workshop "Coding agents for data analysis" - a three hour session aimed at data journalists demonstrating ways that tools like Claude Code and OpenAI Codex can be used to explore, analyze and clean data. Here's the table of contents: <blockquote> <ul> <li><a href="https://simonw.github.io/nicar-2026-coding-agents/codin

How coding agents work

productionSimon Willison's Weblog3/16/2026

<a href="https://simonwillison.net/guides/agentic-engineering-patterns/">Agentic Engineering Patterns</a> > As with any tool, understanding how <a href="https://simonwillison.net/guides/agentic-engineering-patterns/what-is-agentic-engineering/">coding agents</a> work under the hood can help you make better decisions about how to apply them. A coding agent is a piece of software that acts as a harness for an LLM, extending that LLM with additional capabi

What is agentic engineering?

productionSimon Willison's Weblog3/15/2026

<a href="https://simonwillison.net/guides/agentic-engineering-patterns/">Agentic Engineering Patterns</a> > I use the term agentic engineering to describe the practice of developing software with the assistance of coding agents. What are coding agents? They're agents that can both write and execute code. Popular examples include <a href="https://code.claude.com/">Claude Code</a>, <a href="https://openai.com/codex/">OpenAI Codex</a>, and

My fireside chat about agentic engineering at the Pragmatic Summit

productionSimon Willison's Weblog3/14/2026

I was a speaker last month at the <a href="https://www.pragmaticsummit.com/">Pragmatic Summit</a> in San Francisco, where I participated in a fireside chat session about <a href="https://simonwillison.net/guides/agentic-engineering-patterns/">Agentic Engineering</a> hosted by Eric Lui from Statsig. The video is <a href="https://www.youtube.com/watch?v=owmJyKVu5f8">available on YouTube</a>. Here are my highlights from the conversation. <iframe style="margin-top: 1.5em; margin-bottom

Designing AI agents to resist prompt injection

productionOpenAI Blog3/11/2026

How ChatGPT defends against prompt injection and social engineering by constraining risky actions and protecting sensitive data in agent workflows.

AI should help us produce better code

productionSimon Willison's Weblog3/10/2026

<a href="https://simonwillison.net/guides/agentic-engineering-patterns/">Agentic Engineering Patterns</a> > Many developers worry that outsourcing their code to AI tools will result in a drop in quality, producing bad code that's churned out fast enough that decision makers are willing to overlook its flaws. If adopting coding agents demonstrably reduces the quality of the code and features you are producing, you should address that problem directly: figure out which as

Production query plans without production data

productionSimon Willison's Weblog3/9/2026

<a href="https://boringsql.com/posts/portable-stats/">Production query plans without production data</a> Radim Marek describes the new <a href="https://www.postgresql.org/docs/current/functions-admin.html#FUNCTIONS-ADMIN-STATSMOD"><code>pg_restore_relation_stats()</code> and <code>pg_restore_attribute_stats()</code> functions</a> that were introduced <a href="https://www.postgresql.org/docs/current/release-18.html">in PostgreSQL 18</a> in September 2025. The Postgr

LeRobot v0.5.0: Scaling Every Dimension

productionHugging Face Blog3/9/2026

How Balyasny Asset Management built an AI research engine for investing

productionOpenAI Blog3/6/2026

See how Balyasny built an AI research system with GPT-5.4, rigorous model evaluation, and agent workflows to transform investment analysis at scale.

Introducing the Stateful Runtime Environment for Agents in Amazon Bedrock

productionOpenAI Blog2/27/2026

Stateful Runtime for Agents in Amazon Bedrock brings persistent orchestration, memory, and secure execution to multi-step AI workflows powered by OpenAI.

Nano Banana 2: Combining Pro capabilities with lightning-fast speed

productionGoogle AI Blog2/26/2026

<img src="https://storage.googleapis.com/gweb-uniblog-publish-prod/images/NB2_Hero.max-600x600.format-webp.webp">Our latest image generation model offers advanced world knowledge, production-ready specs, subject consistency and more, all at Flash speed.

Pacific Northwest National Laboratory and OpenAI partner to accelerate federal permitting

productionOpenAI Blog2/26/2026

OpenAI and Pacific Northwest National Laboratory introduce DraftNEPABench, a new benchmark evaluating how AI coding agents can accelerate federal permitting—showing potential to reduce NEPA drafting time by up to 15% and modernize infrastructure reviews.

Why we no longer evaluate SWE-bench Verified

productionOpenAI Blog2/23/2026

SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. Our analysis shows flawed tests and training leakage. We recommend SWE-bench Pro.

OpenAI announces Frontier Alliance Partners

productionOpenAI Blog2/23/2026

OpenAI announces Frontier Alliance Partners to help enterprises move from AI pilots to production with secure, scalable agent deployments.

Advancing independent research on AI alignment

productionOpenAI Blog2/19/2026

OpenAI commits $7.5M to The Alignment Project to fund independent AI alignment research, strengthening global efforts to address AGI safety and security risks.

IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

productionHugging Face Blog2/18/2026

Introducing EVMbench

productionOpenAI Blog2/18/2026

OpenAI and Paradigm introduce EVMbench, a benchmark evaluating AI agents’ ability to detect, patch, and exploit high-severity smart contract vulnerabilities.

Beyond rate limits: scaling access to Codex and Sora

productionOpenAI Blog2/13/2026

How OpenAI built a real-time access system combining rate limits, usage tracking, and credits to power continuous access to Sora and Codex.

Scaling social science research

productionOpenAI Blog2/13/2026

GABRIEL is a new open-source toolkit from OpenAI that uses GPT to turn qualitative text and images into quantitative data, helping social scientists analyze research at scale.

OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments

productionHugging Face Blog2/12/2026

Harness engineering: leveraging Codex in an agent-first world

productionOpenAI Blog2/11/2026

By Ryan Lopopolo, Member of the Technical Staff

Bringing ChatGPT to GenAI.mil

productionOpenAI Blog2/9/2026

OpenAI for Government announces the deployment of a custom ChatGPT on GenAI.mil, bringing secure, safety-forward AI to U.S. defense teams.

Unlocking the Codex harness: how we built the App Server

productionOpenAI Blog2/4/2026

Learn how to embed the Codex agent using the Codex App Server, a bidirectional JSON-RPC API powering streaming progress, tool use, approvals, and diffs.

Inside OpenAI’s in-house data agent

productionOpenAI Blog1/29/2026

How OpenAI built an in-house AI data agent that uses GPT-5, Codex, and memory to reason over massive datasets and deliver reliable insights in minutes.

Keeping your data safe when an AI agent clicks a link

productionOpenAI Blog1/28/2026

Learn how OpenAI protects user data when AI agents open links, preventing URL-based data exfiltration and prompt injection with built-in safeguards.

Alyah ⭐️: Toward Robust Evaluation of Emirati Dialect Capabilities in Arabic LLMs

productionHugging Face Blog1/27/2026

Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective

productionHugging Face Blog1/27/2026

Powering tax donations with AI powered personalized recommendations

productionOpenAI Blog1/27/2026

TRUSTBANK partnered with Recursive to build Choice AI using OpenAI models, delivering personalized, conversational recommendations that simplify Furusato Nozei gift discovery. A multi-agent system helps donors navigate thousands of options and find gifts that match their preferences.

Unrolling the Codex agent loop

productionOpenAI Blog1/23/2026

A technical deep dive into the Codex agent loop, explaining how Codex CLI orchestrates models, tools, prompts, and performance using the Responses API.

Scaling PostgreSQL to power 800 million ChatGPT users

productionOpenAI Blog1/22/2026

An inside look at how OpenAI scaled PostgreSQL to millions of queries per second using replicas, caching, rate limiting, and workload isolation.

AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality

productionHugging Face Blog1/21/2026

Netomi’s lessons for scaling agentic systems into the enterprise

productionOpenAI Blog1/8/2026

How Netomi scales enterprise AI agents using GPT-4.1 and GPT-5.2—combining concurrency, governance, and multi-step reasoning for reliable production workflows.

NVIDIA brings agents to life with DGX Spark and Reachy Mini

productionHugging Face Blog1/5/2026

AprielGuard: A Guardrail for Safety and Adversarial Robustness in Modern LLM Systems

productionHugging Face Blog12/23/2025

Continuously hardening ChatGPT Atlas against prompt injection

productionOpenAI Blog12/22/2025

OpenAI is strengthening ChatGPT Atlas against prompt injection attacks using automated red teaming trained with reinforcement learning. This proactive discover-and-patch loop helps identify novel exploits early and harden the browser agent’s defenses as AI becomes more agentic.

Evaluating chain-of-thought monitorability

productionOpenAI Blog12/18/2025

OpenAI introduces a new framework and evaluation suite for chain-of-thought monitorability, covering 13 evaluations across 24 environments. Our findings show that monitoring a model’s internal reasoning is far more effective than monitoring outputs alone, offering a promising path toward scalable control as AI systems grow more capable.

Updating our Model Spec with teen protections

productionOpenAI Blog12/18/2025

OpenAI is updating its Model Spec with new Under-18 Principles that define how ChatGPT should support teens with safe, age-appropriate guidance grounded in developmental science. The update strengthens guardrails, clarifies expected model behavior in higher-risk situations, and builds on our broader work to improve teen safety across ChatGPT.

Addendum to GPT-5.2 System Card: GPT-5.2-Codex

productionOpenAI Blog12/18/2025

This system card outlines the comprehensive safety measures implemented for GPT‑5.2-Codex. It details both model-level mitigations, such as specialized safety training for harmful tasks and prompt injections, and product-level mitigations like agent sandboxing and configurable network access.

The Open Evaluation Standard: Benchmarking NVIDIA Nemotron 3 Nano with NeMo Evaluator

productionHugging Face Blog12/17/2025

Evaluating AI’s ability to perform scientific research tasks

productionOpenAI Blog12/16/2025

OpenAI introduces FrontierScience, a benchmark testing AI reasoning in physics, chemistry, and biology to measure progress toward real scientific research.

OpenAI co-founds Agentic AI Foundation, donates AGENTS.md

productionOpenAI Blog12/9/2025

OpenAI co-founds the Agentic AI Foundation under the Linux Foundation and donates AGENTS.md to support open, interoperable standards for safe agentic AI.

DeepMath: A lightweight math reasoning Agent with smolagents

productionHugging Face Blog12/4/2025

Inside Mirakl's agentic commerce vision

productionOpenAI Blog12/1/2025

Mirakl is redefining commerce through AI agents and ChatGPT Enterprise—achieving faster documentation, smarter customer support, and building toward agent-native commerce with Mirakl Nexus.

Strengthening our safety ecosystem with external testing

productionOpenAI Blog11/19/2025

OpenAI works with independent experts to evaluate frontier AI systems. Third-party testing strengthens safety, validates safeguards, and increases transparency in how we assess model capabilities and risks.

How evals drive the next chapter in AI for businesses

productionOpenAI Blog11/19/2025

Learn how evals help businesses define, measure, and improve AI performance—reducing risk, boosting productivity, and driving strategic advantage.

GPT-5.1-Codex-Max System Card

productionOpenAI Blog11/19/2025

This system card outlines the comprehensive safety measures implemented for GPT‑5.1-CodexMax. It details both model-level mitigations, such as specialized safety training for harmful tasks and prompt injections, and product-level mitigations like agent sandboxing and configurable network access.