Thinking Machines drops a new, highly responsive model designed for humanlike interactions in real time

Overview

Thinking Machines Lab, the AI research startup founded by former OpenAI CTO Mira Murati, announced a research preview of its first interaction models, a new class of multimodal AI designed for real-time, humanlike interaction. The models use a full-duplex architecture so the system can listen, see, and talk at the same time, processing inputs and outputs in 200-millisecond chunks. A dual-model design pairs a fast interaction model with a background model that handles heavier reasoning. The company reports faster turn-taking latency than comparable models on a benchmark called FD-bench.

Key Takeaways

Thinking Machines Lab, founded by Mira Murati, announced a research preview of its first interaction models.
The models use a full-duplex architecture that lets AI listen, see, and talk simultaneously.
The system processes inputs and outputs in 200-millisecond chunks to react in real time.
A dual-model design pairs TML-Interaction-Small with an asynchronous background model that handles complex reasoning, web searches, and tool calls.
On FD-bench, TML-Interaction-Small reached turn-taking latency under 0.4 seconds, ahead of Gemini-3.1-flash-live and GPT-realtime-2.0.

Stats & Key Facts

#Inputs and outputs processed in 200-millisecond chunks
#TML-Interaction-Small is a 276-billion parameter mixture-of-experts model
#TML-Interaction-Small reached turn-taking latency of less than 0.4 seconds on FD-bench
#Gemini-3.1-flash-live clocked in at 0.57 seconds
#GPT-realtime-2.0 achieved 1.18 seconds

Thinking Machines drops a new, highly responsive model designed for humanlike interactions in real time

Moving Beyond Turn-Based AI

Thinking Machines wants to end the pauses in current AI interactions.

›In typical use, a user provides input and then waits before receiving output.
›This happens because existing models wait for users to finish before processing a response.
›The company says back-and-forth interactions force users to contort themselves to the interface.

Over months of use, people have learned to phrase questions like emails and batch their thoughts because current AI cannot handle interruptions or the subtle backchanneling, the mhmms and I sees, of natural conversation.

A Full-Duplex Architecture

The new architecture enables simultaneous listening, seeing, and speaking.

›Full-duplex communication means the AI can listen, see, and talk at the same time.
›The design drops the standard alternating token sequence for a larger, multistream micro-turn-based approach.
›The system processes inputs and outputs in 200-millisecond chunks, reacting to visual or auditory cues even while speaking.

Instead of heavy external encoders to translate audio or video, the model uses encoder-free early fusion that takes raw signals directly through a lightweight embedding layer, processing everything within the transformer for low latency.

The Dual-Model Design

The architecture balances speed with deep reasoning.

›TML-Interaction-Small is a 276-billion parameter mixture-of-experts model that manages dialogue, presence, and immediate follow-ups.
›An asynchronous background model handles complex reasoning, web searches, and tool calls.
›The background model sends its findings to the interaction model to be woven into the live chat.

Benchmark Results

The company reports strong latency results on FD-bench.

›FD-bench is a benchmark designed to measure AI interaction quality.
›TML-Interaction-Small achieved turn-taking latency of less than 0.4 seconds.
›That is ahead of Google's Gemini-3.1-flash-live at 0.57 seconds and GPT-realtime-2.0 at 1.18 seconds.

Why It Matters for Enterprises

The biggest implications could be in enterprise applications.

›Models that see and react in real time could support high-stakes applications.
›The article cites medical surgery as an example of a high-stakes use.
›Faster, more natural interaction aims to make AI a true humanlike collaborator.

Frequently Asked Questions

Who founded Thinking Machines Lab?

It was founded by Mira Murati, the former chief technology officer of OpenAI.

What is a full-duplex interaction model?

It is an architecture that lets AI listen, see, and talk simultaneously, processing inputs and outputs in 200-millisecond chunks so it can react in real time even while speaking.

How does the dual-model design work?

A fast model, TML-Interaction-Small, manages dialogue and immediate follow-ups, while an asynchronous background model handles complex reasoning, web searches, and tool calls and sends findings into the live chat.

How fast is TML-Interaction-Small?

On FD-bench it achieved turn-taking latency of less than 0.4 seconds, ahead of Gemini-3.1-flash-live at 0.57 seconds and GPT-realtime-2.0 at 1.18 seconds.

Where could these models matter most?

The article says the most significant implications could be in enterprise applications, including high-stakes uses like medical surgery.

Thinking Machines' interaction models aim to replace turn-based AI with real-time, full-duplex systems that listen, see, and talk at once.