Which tokens does a hybrid model predict better?

Overview

A Blog post by Ai2 on Hugging Face Back to Articles a]:hidden"> Which tokens does a hybrid model predict better? Enterprise Article Published June 25, 2026 Upvote - Kyle Wiggers Ai2Comms Follow allenai 📄 Tech report: https://arxiv. 20936 Which kinds of tokens does a model predict well, and which does it not?

Key Takeaways

That question is especially intriguing in the case of hybrids, a language model architecture that's begun to challenge the standard transformer and that we've been investigating with Olmo Hybrid .
Hybrids can match or beat transformers on standard benchmarks, but the headline numbers don't reveal much about what specific advantages hybrid models have over transformers.
Viewing these differences at the token level allows us to glean insights about the specific strengths of hybrid models over transformers.
Our results show that the hybrid's advantage is real across many tokens, but not all.
A transformer uses attention in every layer.
The model can draw directly on every earlier token at once, weighing how relevant each is to the current prediction.
Unlike an attention layer, a recurrent layer reads tokens left to right and carries a fixed-size memory, folding each new token into memory as it goes so the cost of processing each token stays flat however long the input gets.
That memory is compressed and lossy, so a recurrent layer can't reach back for an exact earlier token the way attention can.
We recorded the probability each gave to the token that actually followed.

Stats & Key Facts

#20936 Which kinds of tokens does a model predict well, and which does it not?

That question is especially intriguing in the case of hybrids, a language model architecture that's begun to challenge the standard transformer and that we've been investigating with Olmo Hybrid . Hybrids can match or beat transformers on standard benchmarks, but the headline numbers don't reveal much about what specific advantages hybrid models have over transformers. In an attempt to shed light on these token-level behaviors, we recently conducted experiments comparing our own strongest 7B transformer, Olmo 3 , and hybrid model, Olmo Hybrid, head-to-head.

Specifically, we compare the differences in model predictions in a fine-grained way across different types of tokens, or units of information that appear as input to an LLM. Because Olmo 3 and Olmo Hybrid were built to be as alike as possible outside their architectures - closely matched in data, tokenizer, and training recipe - any difference in their predictions mostly reflects the architecture itself. Viewing these differences at the token level allows us to glean insights about the specific strengths of hybrid models over transformers.

Our results show that the hybrid's advantage is real across many tokens, but not all. Olmo Hybrid is strongest on tokens that carry meaning, such as nouns, verbs, and adjectives, and on tokens that can only be predicted by following what's going on, like which person a pronoun refers to. But the hybrid's advantage almost disappears on tokens that simply repeat something already in the input - a word or phrase reproduced verbatim from earlier - where the answer is sitting right there to be looked up.

For more details please read the original article at Hugging Face.

Continue Learning

Foundations

AI Fundamentals: Your First Steps

Foundations

History of AI: From Turing to Today

Foundations

How AI Actually Works (Under the Hood)

Originally published by Hugging Face

Read the original

Stats & Key Facts

Continue Learning

Comments