All articles
Subtitling6 min read

Live subtitling quality: latency, accuracy and the NER model

Word error rate tells you almost nothing about whether a live subtitle is usable. A short guide to the metrics that actually matter on air.

FO
Frans Olsthoorn
Founder & CEO, Scriptix
LinkedIn

Live subtitling is one of the hardest places to deploy ASR. The audience is unforgiving, the audio is messy and the broadcaster carries a regulatory duty to deliver readable, well timed captions. Word error rate alone is a poor predictor of whether a viewer can follow along, and a vendor that quotes only WER on a clean test set is telling you very little about how their system will behave on air.

Three numbers that matter

  • Latency. The delay between sound and subtitle. Anything above three seconds breaks the link with the speaker on screen. The 95th percentile latency matters more than the average, because the outliers are what viewers actually notice.
  • Reading speed. Words per minute on screen. Above 180 wpm the average viewer cannot keep up, and accessibility audiences fall behind much sooner. Good live systems condense rather than cram.
  • NER score. A weighted measure (used by broadcasters and regulators) that captures edition errors, recognition errors and serious meaning changes, not just word swaps.

Why the NER model is the honest one

NER (Number, Edition, Recognition) weighs errors by their impact. A missed filler word counts almost nothing. A wrong negation, a wrong number or a wrong name counts a lot. A captioning service that posts a clean WER but loses every 'not' is failing its viewers; a service with a slightly higher WER but a clean NER score is doing its job. Regulators have noticed, and procurement specs increasingly call out NER explicitly.

Where live systems typically fail

After several years of live deployments, the failure modes cluster into a handful of patterns:

  • Flickering captions caused by an ASR engine that keeps rewriting partials without a stable commit policy.
  • Latency spikes during high energy moments (applause, laughter, music) where the audio stack momentarily struggles.
  • Named entities drifting (politicians, places, sponsors) because the system has no domain vocabulary.
  • Reading speed creep when the speaker is fast and the system refuses to condense, pushing captions off screen before the viewer can read them.

What to demand from a vendor

Ask for live latency at the 95th percentile, not the average. Ask for a NER measurement on a representative sample of your own programming, not a vendor curated demo reel. Ask whether the system can be tuned to your shows, your speakers and your terminology, and whether the tuning happens inside your tenant. Ask what the commit policy is and how it handles long sentences. If the answer to any of those is vague, keep looking.

Live captioning is one of those domains where the difference between a usable system and an unusable one is invisible in the marketing material but very visible on screen. Test on your own content before you sign.

About the author

FO
Frans Olsthoorn
Founder & CEO, Scriptix

Frans Olsthoorn founded Scriptix in 2010 and has spent more than fifteen years shipping speech recognition into European broadcasters, courts and government bodies. He writes about ASR, accessibility regulation and the realities of running AI workloads inside customer infrastructure.

Connect on LinkedIn

Keep reading