Hugging Face has unveiled Open Responses, an open inference standard designed to replace legacy chat completion formats. The initiative aims to support autonomous agents that reason, plan, and act over extended time horizons. Industry observers note this move could significantly alter how developers build complex AI applications.
Current inference workloads are dominated by agents rather than simple chatbots. Developers are shifting toward autonomous systems that require interfaces beyond turn-based conversations. The Chat Completion format remains the de facto standard despite falling short for these agentic use cases.
This mismatch between workflow requirements and entrenched interfaces motivates the need for an open inference standard. Open Responses builds on the direction OpenAI has set with their Responses API launched in March 2025. The new format extends and open-sources the API to make it more accessible for builders and routing providers.
Client requests to Open Responses are similar to the existing Responses API schema. Clients that already support the Responses API can migrate to Open Responses with relatively little effort. The main changes involve how reasoning content is exposed to the client application.
OpenAI models used to only expose summary and encrypted content in previous iterations. With Open Responses, providers may expose their raw reasoning via the API for greater transparency. Clients migrating from providers that previously returned only summaries will now receive raw reasoning streams when supported.
For Model Providers, implementing the changes for Open Responses should be straightforward if they already adhere to the Responses API specification. Routers now have the opportunity to standardize on a consistent endpoint for orchestration. Over time, as Providers continue to innovate, certain features will become standardized in the base specification.
Open Responses distinguishes between Model Providers and Routers who orchestrate between multiple providers. Clients can now specify a Provider along with provider-specific API options when making requests. This allows intermediary Routers to manage requests between upstream providers efficiently.
The format formalizes the agentic loop which is usually made up of a repeating cycle of reasoning and tool invocation. Internally hosted tools allow the provider to manage the entire loop without developer intervention. Clients control loop behavior via max_tool_calls to cap iterations and tool_choice to constrain which tools are invocable. The response contains all intermediate items including tool calls, results, and reasoning data.
Migrating to Open Responses will make the inference experience more consistent and improve quality. Undocumented extensions and workarounds of the legacy Completions API are normalized in Open Responses. The team looks forward to working with the community at large on future development of the specification.
This development suggests that next steps for local LLM endpoint providers include supporting hosted tools. The industry can expect to see a lot more of this pattern especially for Agents offloading work to sub-agent Tool Loops via Open Responses. The sector watches closely to see if this standardizes the agentic inference landscape moving forward.